Artifacts¶
This document describes every file the pipeline reads and writes under a client workspace, when each is generated, what it is for, and whether the rest of the pipeline depends on it.
Schemas. Every JSON artifact described below has a public JSON Schema 2020-12 contract shipped with the wheel. See JSON Schemas for the loader API, validation recipes, and versioning rules.
Workspace layout¶
clients/<client>/
├── input/ owned by the user
│ ├── urls.csv inventory input (CSV or JSON)
│ ├── links.csv edge list (CSV or JSON; optional)
│ └── project.md free-form editorial notes (optional)
├── config/ per-client overrides (optional)
│ ├── classifier.json override classifier rules
│ └── commercial_urls.json promote URLs to `landing`
├── data/ generated by `build-*` and `import-*`
│ ├── content_inventory.json
│ ├── internal_link_graph.json
│ ├── keyword_metrics.json (if `import-keywords` ran)
│ └── search_performance.json (if `import-search-performance` ran)
├── output/ generated by `build-context-pack`
│ ├── agent_context_pack.json
│ ├── agent_context_pack.md
│ └── content_opportunities.md
└── logs/ reserved (currently unused by 0.1 core)
Generated artifacts¶
data/content_inventory.json¶
| Generated by | site-context-pipeline build-inventory --write |
| Source | --source PATH (CSV, JSON, or sitemap XML URL list); use --format to force a reader. |
| Required by | build-link-graph, build-context-pack |
| Optional? | Required for the rest of the core pipeline |
A list of objects, one per page, with the page type and the classification reason that fired:
[
{
"url": "https://example.com/blog/how-to-plan-delivery/",
"path": "/blog/how-to-plan-delivery/",
"page_type": "blog",
"classification_reason": "matched_pattern:*/blog/*",
"title": "How to plan a delivery",
"h1": "How to plan a delivery",
"status_code": 200,
"word_count": 1100,
"inlinks_count": 2,
"outlinks_count": 3,
"source": "csv"
}
]
page_type is one of home, service, blog, category, landing,
other. classification_reason is a stable string token suitable for
filtering and reporting.
data/internal_link_graph.json¶
| Generated by | site-context-pipeline build-link-graph --write |
| Source | --source PATH (edge CSV or JSON) — falls back to <client>/input/links.csv |
| Required by | build-context-pack (used by the opportunities sections) |
| Optional? | Optional; the pack still builds without it but with fewer signals |
Two flat lists plus two derived opportunity lists:
{
"nodes": [
{
"url": "https://example.com/services/local-delivery/",
"page_type": "service",
"inlink_count": 1,
"outlink_count": 0,
"blog_inlink_count": 1,
"is_commercial_target": true
}
],
"edges": [
{
"source_url": "https://example.com/blog/how-to-plan-delivery/",
"target_url": "https://example.com/services/local-delivery/",
"anchor_text": "local delivery"
}
],
"commercial_pages_low_blog_inlinks": [],
"blog_pages_low_inlinks": [],
"warnings": []
}
If the user does not provide an edge list, the file is still written
but edges is empty and a no_edges_in_input_using_inventory_counts_only
warning is recorded.
data/keyword_metrics.json¶
| Generated by | site-context-pipeline import-keywords --provider <NAME> --source <PATH> --write |
| Source | provider-specific input (e.g. CSV for local-csv) |
| Read by | build-context-pack |
| Optional? | Optional. Without it the pack records a missing_keyword_data warning. |
A small envelope around a list of KeywordMetric rows. Every row
carries source = <provider_name> so a reviewer can trace which
adapter produced which row.
{
"schema_version": 1,
"provider": "local-csv",
"items_count": 6,
"metadata": {"source_path": "...", "row_count": 6, "items_count": 6},
"warnings": [],
"items": [
{
"query": "local delivery service",
"source": "local-csv",
"avg_monthly_searches": 3600,
"competition": "HIGH",
"geo": "US",
"language": "en",
"source_url": "https://example.com/services/local-delivery/",
"raw": {}
}
]
}
The full row schema is in docs/providers.md.
data/search_performance.json¶
| Generated by | site-context-pipeline import-search-performance --provider <NAME> --source <PATH> --write |
| Source | provider-specific input (e.g. Google Search Console export for local-gsc-csv) |
| Read by | build-context-pack |
| Optional? | Optional. Without it the pack omits the search-performance summary, weak-CTR list, and ranked-but-unsupported list. |
Same envelope shape as keyword_metrics.json. Rows usually fill in
impressions, clicks, ctr, position instead of
avg_monthly_searches.
data/search_evidence.json (optional)¶
| Generated by | site-context-pipeline import-search-evidence --provider <NAME> --source <PATH> --write |
| Source | provider-specific input (e.g. a hand-curated CSV for local-serp-csv) |
| Read by | build-context-pack |
| Optional? | Optional. Without it the pack omits the "What competitors do" section. |
A list of SearchEvidence rows (one per organic SERP hit you have
captured). Schema fields: query, rank, title, url, snippet,
page_type, plus a raw dict for unconsumed CSV columns.
Hand-curated CSVs are the standard input. The toolkit deliberately does not scrape live SERPs in 0.x.
output/agent_context_pack.json¶
| Generated by | site-context-pipeline build-context-pack --write |
| Source | data/content_inventory.json, data/internal_link_graph.json, optional data/keyword_metrics.json, optional data/search_performance.json, input/project.md |
| Read by | downstream LLM workflows, human review |
| Optional? | This is the primary output of the toolkit |
Machine-readable digest. Stable schema (schema_version integer),
designed to be the single document an LLM (or a human reviewer) reads
before drafting a brief or a content change. Top-level keys:
| Key | Description |
|---|---|
schema_version |
integer; bumped only when the shape changes |
generated_at |
ISO-8601 UTC timestamp |
client |
client identifier from the CLI |
summary |
counts: pages, edges, nodes, keyword rows, performance rows |
classification.reasons |
counts per classification reason |
pages |
pages grouped by page_type (home, landing, service, category, blog) |
opportunities |
derived lists (see below) |
search_performance_summary |
totals + impression-weighted average position + average CTR |
providers.{keyword_metrics,search_performance} |
provenance: which adapter produced the data |
project_notes |
verbatim contents of input/project.md |
sources |
absolute paths of the files this pack was built from |
warnings |
machine-readable token list (e.g. missing_keyword_data:...) |
opportunities carries five lists:
commercial_pages_low_blog_inlinks— service/landing/category pages with zero blog inlinks.blog_pages_low_inlinks— blog pages with at most one inlink.top_keywords— keyword rows ranked by demand (or impressions if demand is missing). Empty when no keyword artifact exists.weak_ctr_pages— query/page rows with ≥ 100 impressions and CTR ≤ 2 %. Empty when no performance artifact exists.ranked_but_unsupported— pages with best position ≤ 20 that receive zero inlinks or zero blog inlinks. Empty when no performance artifact exists.
output/agent_context_pack.md¶
| Generated by | site-context-pipeline build-context-pack --write |
| Source | the same data as the JSON pack, rendered as Markdown |
| Read by | humans, code review, LLM prompts |
| Optional? | Always written alongside the JSON pack |
Human-readable mirror of the JSON pack. Sections appear conditionally:
the Top keyword opportunities, Pages with impressions but weak CTR,
Pages with rankings but weak internal support, and Search
performance summary sections are emitted only when their underlying
artifacts exist. Otherwise a Missing keyword data line points at the
import-* commands.
output/content_opportunities.md¶
| Generated by | site-context-pipeline build-context-pack --write |
| Source | output/agent_context_pack.json (same in-memory build) |
| Read by | humans (review checklist) |
| Optional? | Always written alongside the pack |
Deterministic shortlist of gaps: orphan blog posts, commercial pages with no blog support, weak-CTR queries, ranked-but-unsupported pages. Designed to be a human review prompt rather than a ranking.
Required vs optional, summarised¶
| Artifact | Required for the pipeline to run? |
|---|---|
data/content_inventory.json |
yes — every other step reads it |
data/internal_link_graph.json |
optional — pack still builds, with fewer signals |
data/keyword_metrics.json |
optional — adds keyword opportunities to the pack |
data/search_performance.json |
optional — adds performance signals to the pack |
data/search_evidence.json |
optional — adds the "What competitors do" section to the pack |
output/agent_context_pack.json |
always written by build-context-pack |
output/agent_context_pack.md |
always written by build-context-pack |
output/content_opportunities.md |
always written by build-context-pack |