site-context-pipeline¶

Convert website crawls, URL inventories, and editorial notes into structured context packs for human-reviewed, LLM-assisted content workflows.

site-context-pipeline is a small, dependency-free Python CLI that turns the boring-but-essential facts about a website — URL inventory, internal link graph, keyword data, search performance — into a stable, machine- and human-readable digest. The digest is the artifact you hand to a language model (or to a human writer) before they touch a brief or a draft.

What this site covers¶

:material-rocket-launch:{ .lg .middle } Tutorial

A 10-minute end-to-end walk-through. From a CSV (or sitemap) to a finished context pack, with explanations for every step.
:material-toolbox-outline:{ .lg .middle } Recipes

Concrete workflows: onboarding a new site, quarterly audits, pre-rebrand snapshots, gating drafts in CI, handing the pack to an LLM with citations.
:material-compare:{ .lg .middle } How this compares

Honest comparison vs Screaming Frog, Sitebulb, ContentKing, and rolling your own script. Where this toolkit is the right answer and where it isn't.
:material-storefront-outline:{ .lg .middle } Demo clients

Three synthetic workspaces shipped under examples/: a small services site, a coffee-equipment storefront, and a three-language docs site. Concrete starting points that exercise different IAs.
:material-sitemap:{ .lg .middle } Architecture

The one-way pipeline: local inputs → providers → normalised artifacts → context pack. Why the core never reaches the network.
:material-database-import:{ .lg .middle } Providers

The provider abstraction, the four safety rules, and reference docs for every shipped provider (local-csv, local-gsc-csv, local-serp-csv, plus the google-ads and google-search-console stubs).
:material-table-of-contents:{ .lg .middle } Provider reference

A uniform per-provider reference: config keys, input columns, failure modes, rate limits, and a worked example for each shipped provider — and the template for adding your own.
:material-file-document-outline:{ .lg .middle } Artifacts

Every file the pipeline writes: when it is generated, the command that produces it, and whether it is required or optional.
:material-shape:{ .lg .middle } Classifier

The config/classifier.json schema: priorities, exclude patterns, allow-lists, and the warnings the inventory emits when a rule is invalid.
:material-clipboard-check-outline:{ .lg .middle } QA

The deterministic Markdown-draft QA module. Nine offline checks, structured JSON output, exit code 1 on red findings so CI can gate on them.
:material-shield-check-outline:{ .lg .middle } JSON Schemas

Public JSON Schema 2020-12 contracts for every artifact, shipped with the wheel. Stable contract for LLM consumers, CI gating, and code generation in any language.
:material-road:{ .lg .middle } Roadmap

What landed in 0.x and what is planned for 0.4. Every live API adapter is opt-in behind an extra; the base install never grows runtime dependencies.
:material-history:{ .lg .middle } Changelog

Every release with what was added, changed, and fixed.

Install¶

pip install site-context-pipeline

Requires Python ≥ 3.11. The base install has zero runtime dependencies.

Sixty-second demo¶

site-context-pipeline init --client demo --write
site-context-pipeline build-inventory --client demo \
    --source examples/demo-client/input/urls.csv --write
site-context-pipeline build-link-graph --client demo \
    --source examples/demo-client/input/links.csv --write
site-context-pipeline build-context-pack --client demo --write
site-context-pipeline inspect --client demo

After that, clients/demo/output/agent_context_pack.md is the digest a reviewer (or an LLM) reads before drafting anything.

Project at a glance¶

Offline by default. The 0.x core is standard-library only. Live API adapters, when they ship, sit behind optional extras.
Deterministic. Same input, same output. Every artifact records where each fact came from in a sources map.
Vendor-neutral core. Vendor-specific names (e.g. google-ads) live only on provider identifiers, never in core schemas, artifact field names, or CLI verbs.
Human-review first. The pack is designed for a human reviewer; LLM consumption is a side benefit.

For the full README and contribution guide, see the GitHub repository.