Architecture¶
site-context-pipeline is a one-way pipeline that turns local input
files into local output files. It does not run as a service, does not
hold state between commands, and never reaches the network from its
core code. The intended deployment is a developer's laptop or a CI
runner.
High-level flow¶
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Local inputs │ ──► │ Providers │ ──► │ Normalised data │
│ (CSV / JSON / MD)│ │ (optional) │ │ artifacts (JSON) │
└──────────────────┘ └──────────────────┘ └────────┬─────────┘
│
▼
┌──────────────────┐
│ Context pack │
│ (JSON + Markdown)│
└────────┬─────────┘
│
▼
Human review → downstream
LLM-assisted workflows
Every box on the left is a file the user owns. Every box in the middle is plain Python on plain dicts. The right-hand side is also files — JSON and Markdown — that any tool can read.
Components¶
src/site_context_pipeline/
├── cli.py argparse entry point; one JSON payload per command
├── clients.py ClientPaths, init_client, JSON I/O helpers
├── inventory.py classify URLs into page types, write content_inventory.json
├── link_graph.py join inventory + edges, write internal_link_graph.json
├── context_pack.py aggregate everything → agent_context_pack.{json,md}
├── markdown.py tiny Markdown rendering helpers (no third-party deps)
├── schemas.py dataclasses (InventoryItem, LinkNode, KeywordMetric, ...)
└── providers/
├── base.py abstract base classes + error types + result helpers
├── registry.py KEYWORD_PROVIDERS, SEARCH_PERFORMANCE_PROVIDERS
├── local_keyword_csv.py live, offline
├── local_search_console_csv.py live, offline
├── google_ads_keyword_planner.py stub → not_configured
└── google_search_console.py stub → not_configured
The providers/ package is the only place the toolkit allows
vendor-specific code. Everything outside providers/ reads and writes
generic shapes (InventoryItem, LinkNode, KeywordMetric) and never
mentions a search vendor.
Why a one-way pipeline?¶
The data flow goes from inputs to outputs and never loops back. This is deliberate:
- Re-runnable. If an upstream input changes, the user re-runs the affected step. No state to invalidate.
- Composable. Each step's output is a file the next step reads. Users can replace any step with a script of their own as long as it emits the documented JSON shape.
- Auditable. Every artifact carries a
sourceorclassification_reasonfield so a reviewer can trace a fact back to the file that produced it. - Test-friendly. Tests run each step against a
tmp_pathworkspace and check the generated files. No mocks for external services because there are no external services.
Vendor neutrality¶
The pipeline is built around a hard rule: the core does not know which vendor produced the data.
This works because:
- Providers normalise external data into local files. A provider
reads its source (today: a CSV; tomorrow: a vendor API) and emits a
ProviderResultwhoseitemsare genericKeywordMetricrows. The CLI persists that todata/keyword_metrics.jsonordata/search_performance.json. context_pack.pyreads normalised artifacts only. It opensdata/keyword_metrics.jsonand treats every row as data. It does not branch onitem["source"]. It does not import any provider. It does not even need the providers package on the import path to run.- Vendor-specific names live in providers, never in the core.
The schemas, the CLI verbs (
build-context-pack,inspect,import-keywords,list-providers), the artifact filenames (keyword_metrics.json,search_performance.json) and the field names (avg_monthly_searches,impressions,clicks,ctr,position) are all vendor-neutral. A provider's identifier likegoogle-adsmay be vendor-specific by design — that is what tells the user which API the future live adapter will call.
The result: swapping google-ads for a hypothetical Yandex Wordstat
adapter, or for a community DataForSEO adapter, is a one-file change in
providers/. Nothing in the core needs to move.
Process model¶
Every CLI command:
- Resolves the workspace path and the client identifier (validating the latter against a strict regex).
- Calls one pure-Python builder function (e.g.
build_inventory,build_context_pack). - Writes a JSON payload to stdout describing what happened. With
--write, the builder also touches the filesystem; without it, the command is a dry run. - Exits with code
0on success,1on failure. Failure is always represented in the JSON payload (ok: false,errors: [...]) before the process returns.
This means the CLI is safe to invoke from a CI matrix or a Makefile: the JSON output is always parseable, even on the failure path.
Failure modes¶
The pipeline distinguishes three kinds of failure:
| Kind | Example | How it surfaces |
|---|---|---|
| Malformed input | CSV path does not exist | ProviderConfigurationError raised inside the builder, converted to ok: false + a single error string by the CLI. Exit code 1. |
| Adapter not configured | google-ads stub called without credentials |
Adapter returns a ProviderResult with ok=False, errors=["not_configured"], no exception. Exit code 1. |
| Empty but legal | Inventory CSV with zero rows | Builder writes an empty content_inventory.json and records a warning (inventory_missing_or_empty). Exit code 0. |
The first two yield a non-zero exit code; the third does not. This keeps the CLI usable in pipelines where "0 rows" is a valid outcome (e.g. a brand-new client) but "your CSV is missing" is not.
What is not in scope¶
- No live HTTP from the core. No live HTTP from
local-*providers. Live HTTP only ever appears in optional adapters underproviders/, behind anpip install site-context-pipeline[<extra>]install gate. - No mutable global state. Every command reads and writes through
ClientPathsso two clients never share data by accident. - No long-running processes. Every command runs to completion and exits.