Artifacts¶

This document describes every file the pipeline reads and writes under a client workspace, when each is generated, what it is for, and whether the rest of the pipeline depends on it.

Schemas. Every JSON artifact described below has a public JSON Schema 2020-12 contract shipped with the wheel. See JSON Schemas for the loader API, validation recipes, and versioning rules.

Workspace layout¶

clients/<client>/
├── input/                       owned by the user
│   ├── urls.csv                 inventory input (CSV or JSON)
│   ├── links.csv                edge list (CSV or JSON; optional)
│   └── project.md               free-form editorial notes (optional)
├── config/                      per-client overrides (optional)
│   ├── classifier.json          override classifier rules
│   └── commercial_urls.json     promote URLs to `landing`
├── data/                        generated by `build-*` and `import-*`
│   ├── content_inventory.json
│   ├── internal_link_graph.json
│   ├── keyword_metrics.json     (if `import-keywords` ran)
│   └── search_performance.json  (if `import-search-performance` ran)
├── output/                      generated by `build-context-pack`
│   ├── agent_context_pack.json
│   ├── agent_context_pack.md
│   └── content_opportunities.md
└── logs/                        reserved (currently unused by 0.1 core)

Generated artifacts¶

`data/content_inventory.json`¶


Generated by	`site-context-pipeline build-inventory --write`
Source	`--source PATH` (CSV, JSON, or sitemap XML URL list); use `--format` to force a reader.
Required by	`build-link-graph`, `build-context-pack`
Optional?	Required for the rest of the core pipeline

A list of objects, one per page, with the page type and the classification reason that fired:

[
  {
    "url": "https://example.com/blog/how-to-plan-delivery/",
    "path": "/blog/how-to-plan-delivery/",
    "page_type": "blog",
    "classification_reason": "matched_pattern:*/blog/*",
    "title": "How to plan a delivery",
    "h1": "How to plan a delivery",
    "status_code": 200,
    "word_count": 1100,
    "inlinks_count": 2,
    "outlinks_count": 3,
    "source": "csv"
  }
]

page_type is one of home, service, blog, category, landing, other. classification_reason is a stable string token suitable for filtering and reporting.

`data/internal_link_graph.json`¶


Generated by	`site-context-pipeline build-link-graph --write`
Source	`--source PATH` (edge CSV or JSON) — falls back to `<client>/input/links.csv`
Required by	`build-context-pack` (used by the opportunities sections)
Optional?	Optional; the pack still builds without it but with fewer signals

Two flat lists plus two derived opportunity lists:

{
  "nodes": [
    {
      "url": "https://example.com/services/local-delivery/",
      "page_type": "service",
      "inlink_count": 1,
      "outlink_count": 0,
      "blog_inlink_count": 1,
      "is_commercial_target": true
    }
  ],
  "edges": [
    {
      "source_url": "https://example.com/blog/how-to-plan-delivery/",
      "target_url": "https://example.com/services/local-delivery/",
      "anchor_text": "local delivery"
    }
  ],
  "commercial_pages_low_blog_inlinks": [],
  "blog_pages_low_inlinks": [],
  "warnings": []
}

If the user does not provide an edge list, the file is still written but edges is empty and a no_edges_in_input_using_inventory_counts_only warning is recorded.

`data/keyword_metrics.json`¶


Generated by	`site-context-pipeline import-keywords --provider <NAME> --source <PATH> --write`
Source	provider-specific input (e.g. CSV for `local-csv`)
Read by	`build-context-pack`
Optional?	Optional. Without it the pack records a `missing_keyword_data` warning.

A small envelope around a list of KeywordMetric rows. Every row carries source = <provider_name> so a reviewer can trace which adapter produced which row.

{
  "schema_version": 1,
  "provider": "local-csv",
  "items_count": 6,
  "metadata": {"source_path": "...", "row_count": 6, "items_count": 6},
  "warnings": [],
  "items": [
    {
      "query": "local delivery service",
      "source": "local-csv",
      "avg_monthly_searches": 3600,
      "competition": "HIGH",
      "geo": "US",
      "language": "en",
      "source_url": "https://example.com/services/local-delivery/",
      "raw": {}
    }
  ]
}

The full row schema is in docs/providers.md.

`data/search_performance.json`¶


Generated by	`site-context-pipeline import-search-performance --provider <NAME> --source <PATH> --write`
Source	provider-specific input (e.g. Google Search Console export for `local-gsc-csv`)
Read by	`build-context-pack`
Optional?	Optional. Without it the pack omits the search-performance summary, weak-CTR list, and ranked-but-unsupported list.

Same envelope shape as keyword_metrics.json. Rows usually fill in impressions, clicks, ctr, position instead of avg_monthly_searches.

`data/search_evidence.json` (optional)¶


Generated by	`site-context-pipeline import-search-evidence --provider <NAME> --source <PATH> --write`
Source	provider-specific input (e.g. a hand-curated CSV for `local-serp-csv`)
Read by	`build-context-pack`
Optional?	Optional. Without it the pack omits the "What competitors do" section.

A list of SearchEvidence rows (one per organic SERP hit you have captured). Schema fields: query, rank, title, url, snippet, page_type, plus a raw dict for unconsumed CSV columns.

Hand-curated CSVs are the standard input. The toolkit deliberately does not scrape live SERPs in 0.x.

`output/agent_context_pack.json`¶


Generated by	`site-context-pipeline build-context-pack --write`
Source	`data/content_inventory.json`, `data/internal_link_graph.json`, optional `data/keyword_metrics.json`, optional `data/search_performance.json`, `input/project.md`
Read by	downstream LLM workflows, human review
Optional?	This is the primary output of the toolkit

Machine-readable digest. Stable schema (schema_version integer), designed to be the single document an LLM (or a human reviewer) reads before drafting a brief or a content change. Top-level keys:

Key	Description
`schema_version`	integer; bumped only when the shape changes
`generated_at`	ISO-8601 UTC timestamp
`client`	client identifier from the CLI
`summary`	counts: pages, edges, nodes, keyword rows, performance rows
`classification.reasons`	counts per classification reason
`pages`	pages grouped by `page_type` (`home`, `landing`, `service`, `category`, `blog`)
`opportunities`	derived lists (see below)
`search_performance_summary`	totals + impression-weighted average position + average CTR
`providers.{keyword_metrics,search_performance}`	provenance: which adapter produced the data
`project_notes`	verbatim contents of `input/project.md`
`sources`	absolute paths of the files this pack was built from
`warnings`	machine-readable token list (e.g. `missing_keyword_data:...`)

opportunities carries five lists:

commercial_pages_low_blog_inlinks — service/landing/category pages with zero blog inlinks.
blog_pages_low_inlinks — blog pages with at most one inlink.
top_keywords — keyword rows ranked by demand (or impressions if demand is missing). Empty when no keyword artifact exists.
weak_ctr_pages — query/page rows with ≥ 100 impressions and CTR ≤ 2 %. Empty when no performance artifact exists.
ranked_but_unsupported — pages with best position ≤ 20 that receive zero inlinks or zero blog inlinks. Empty when no performance artifact exists.

`output/agent_context_pack.md`¶


Generated by	`site-context-pipeline build-context-pack --write`
Source	the same data as the JSON pack, rendered as Markdown
Read by	humans, code review, LLM prompts
Optional?	Always written alongside the JSON pack

Human-readable mirror of the JSON pack. Sections appear conditionally: the Top keyword opportunities, Pages with impressions but weak CTR, Pages with rankings but weak internal support, and Search performance summary sections are emitted only when their underlying artifacts exist. Otherwise a Missing keyword data line points at the import-* commands.

`output/content_opportunities.md`¶


Generated by	`site-context-pipeline build-context-pack --write`
Source	`output/agent_context_pack.json` (same in-memory build)
Read by	humans (review checklist)
Optional?	Always written alongside the pack

Deterministic shortlist of gaps: orphan blog posts, commercial pages with no blog support, weak-CTR queries, ranked-but-unsupported pages. Designed to be a human review prompt rather than a ranking.

Required vs optional, summarised¶

Artifact	Required for the pipeline to run?
`data/content_inventory.json`	yes — every other step reads it
`data/internal_link_graph.json`	optional — pack still builds, with fewer signals
`data/keyword_metrics.json`	optional — adds keyword opportunities to the pack
`data/search_performance.json`	optional — adds performance signals to the pack
`data/search_evidence.json`	optional — adds the "What competitors do" section to the pack
`output/agent_context_pack.json`	always written by `build-context-pack`
`output/agent_context_pack.md`	always written by `build-context-pack`
`output/content_opportunities.md`	always written by `build-context-pack`

Artifacts¶

Workspace layout¶

Generated artifacts¶

data/content_inventory.json¶

data/internal_link_graph.json¶

data/keyword_metrics.json¶

data/search_performance.json¶

data/search_evidence.json (optional)¶

output/agent_context_pack.json¶

output/agent_context_pack.md¶

output/content_opportunities.md¶