site-context-pipeline¶
Convert website crawls, URL inventories, and editorial notes into structured context packs for human-reviewed, LLM-assisted content workflows.
site-context-pipeline is a small, dependency-free Python CLI that
turns the boring-but-essential facts about a website — URL inventory,
internal link graph, keyword data, search performance — into a stable,
machine- and human-readable digest. The digest is the artifact you
hand to a language model (or to a human writer) before they touch a
brief or a draft.
What this site covers¶
-
:material-rocket-launch:{ .lg .middle } Tutorial
A 10-minute end-to-end walk-through. From a CSV (or sitemap) to a finished context pack, with explanations for every step.
-
:material-toolbox-outline:{ .lg .middle } Recipes
Concrete workflows: onboarding a new site, quarterly audits, pre-rebrand snapshots, gating drafts in CI, handing the pack to an LLM with citations.
-
:material-compare:{ .lg .middle } How this compares
Honest comparison vs Screaming Frog, Sitebulb, ContentKing, and rolling your own script. Where this toolkit is the right answer and where it isn't.
-
:material-storefront-outline:{ .lg .middle } Demo clients
Three synthetic workspaces shipped under
examples/: a small services site, a coffee-equipment storefront, and a three-language docs site. Concrete starting points that exercise different IAs. -
:material-sitemap:{ .lg .middle } Architecture
The one-way pipeline: local inputs → providers → normalised artifacts → context pack. Why the core never reaches the network.
-
:material-database-import:{ .lg .middle } Providers
The provider abstraction, the four safety rules, and reference docs for every shipped provider (
local-csv,local-gsc-csv,local-serp-csv, plus thegoogle-adsandgoogle-search-consolestubs). -
:material-table-of-contents:{ .lg .middle } Provider reference
A uniform per-provider reference: config keys, input columns, failure modes, rate limits, and a worked example for each shipped provider — and the template for adding your own.
-
:material-file-document-outline:{ .lg .middle } Artifacts
Every file the pipeline writes: when it is generated, the command that produces it, and whether it is required or optional.
-
:material-shape:{ .lg .middle } Classifier
The
config/classifier.jsonschema: priorities, exclude patterns, allow-lists, and the warnings the inventory emits when a rule is invalid. -
:material-clipboard-check-outline:{ .lg .middle } QA
The deterministic Markdown-draft QA module. Nine offline checks, structured JSON output, exit code 1 on red findings so CI can gate on them.
-
:material-shield-check-outline:{ .lg .middle } JSON Schemas
Public JSON Schema 2020-12 contracts for every artifact, shipped with the wheel. Stable contract for LLM consumers, CI gating, and code generation in any language.
-
:material-road:{ .lg .middle } Roadmap
What landed in 0.x and what is planned for 0.4. Every live API adapter is opt-in behind an extra; the base install never grows runtime dependencies.
-
:material-history:{ .lg .middle } Changelog
Every release with what was added, changed, and fixed.
Install¶
Requires Python ≥ 3.11. The base install has zero runtime dependencies.
Sixty-second demo¶
site-context-pipeline init --client demo --write
site-context-pipeline build-inventory --client demo \
--source examples/demo-client/input/urls.csv --write
site-context-pipeline build-link-graph --client demo \
--source examples/demo-client/input/links.csv --write
site-context-pipeline build-context-pack --client demo --write
site-context-pipeline inspect --client demo
After that, clients/demo/output/agent_context_pack.md is the digest a
reviewer (or an LLM) reads before drafting anything.
Project at a glance¶
- Offline by default. The 0.x core is standard-library only. Live API adapters, when they ship, sit behind optional extras.
- Deterministic. Same input, same output. Every artifact records
where each fact came from in a
sourcesmap. - Vendor-neutral core. Vendor-specific names (e.g.
google-ads) live only on provider identifiers, never in core schemas, artifact field names, or CLI verbs. - Human-review first. The pack is designed for a human reviewer; LLM consumption is a side benefit.
For the full README and contribution guide, see the GitHub repository.