Skip to content

Artifacts

This document describes every file the pipeline reads and writes under a client workspace, when each is generated, what it is for, and whether the rest of the pipeline depends on it.

Schemas. Every JSON artifact described below has a public JSON Schema 2020-12 contract shipped with the wheel. See JSON Schemas for the loader API, validation recipes, and versioning rules.

Workspace layout

clients/<client>/
├── input/                       owned by the user
│   ├── urls.csv                 inventory input (CSV or JSON)
│   ├── links.csv                edge list (CSV or JSON; optional)
│   └── project.md               free-form editorial notes (optional)
├── config/                      per-client overrides (optional)
│   ├── classifier.json          override classifier rules
│   └── commercial_urls.json     promote URLs to `landing`
├── data/                        generated by `build-*` and `import-*`
│   ├── content_inventory.json
│   ├── internal_link_graph.json
│   ├── keyword_metrics.json     (if `import-keywords` ran)
│   └── search_performance.json  (if `import-search-performance` ran)
├── output/                      generated by `build-context-pack`
│   ├── agent_context_pack.json
│   ├── agent_context_pack.md
│   └── content_opportunities.md
└── logs/                        reserved (currently unused by 0.1 core)

Generated artifacts

data/content_inventory.json

Generated by site-context-pipeline build-inventory --write
Source --source PATH (CSV, JSON, or sitemap XML URL list); use --format to force a reader.
Required by build-link-graph, build-context-pack
Optional? Required for the rest of the core pipeline

A list of objects, one per page, with the page type and the classification reason that fired:

[
  {
    "url": "https://example.com/blog/how-to-plan-delivery/",
    "path": "/blog/how-to-plan-delivery/",
    "page_type": "blog",
    "classification_reason": "matched_pattern:*/blog/*",
    "title": "How to plan a delivery",
    "h1": "How to plan a delivery",
    "status_code": 200,
    "word_count": 1100,
    "inlinks_count": 2,
    "outlinks_count": 3,
    "source": "csv"
  }
]

page_type is one of home, service, blog, category, landing, other. classification_reason is a stable string token suitable for filtering and reporting.

Generated by site-context-pipeline build-link-graph --write
Source --source PATH (edge CSV or JSON) — falls back to <client>/input/links.csv
Required by build-context-pack (used by the opportunities sections)
Optional? Optional; the pack still builds without it but with fewer signals

Two flat lists plus two derived opportunity lists:

{
  "nodes": [
    {
      "url": "https://example.com/services/local-delivery/",
      "page_type": "service",
      "inlink_count": 1,
      "outlink_count": 0,
      "blog_inlink_count": 1,
      "is_commercial_target": true
    }
  ],
  "edges": [
    {
      "source_url": "https://example.com/blog/how-to-plan-delivery/",
      "target_url": "https://example.com/services/local-delivery/",
      "anchor_text": "local delivery"
    }
  ],
  "commercial_pages_low_blog_inlinks": [],
  "blog_pages_low_inlinks": [],
  "warnings": []
}

If the user does not provide an edge list, the file is still written but edges is empty and a no_edges_in_input_using_inventory_counts_only warning is recorded.

data/keyword_metrics.json

Generated by site-context-pipeline import-keywords --provider <NAME> --source <PATH> --write
Source provider-specific input (e.g. CSV for local-csv)
Read by build-context-pack
Optional? Optional. Without it the pack records a missing_keyword_data warning.

A small envelope around a list of KeywordMetric rows. Every row carries source = <provider_name> so a reviewer can trace which adapter produced which row.

{
  "schema_version": 1,
  "provider": "local-csv",
  "items_count": 6,
  "metadata": {"source_path": "...", "row_count": 6, "items_count": 6},
  "warnings": [],
  "items": [
    {
      "query": "local delivery service",
      "source": "local-csv",
      "avg_monthly_searches": 3600,
      "competition": "HIGH",
      "geo": "US",
      "language": "en",
      "source_url": "https://example.com/services/local-delivery/",
      "raw": {}
    }
  ]
}

The full row schema is in docs/providers.md.

data/search_performance.json

Generated by site-context-pipeline import-search-performance --provider <NAME> --source <PATH> --write
Source provider-specific input (e.g. Google Search Console export for local-gsc-csv)
Read by build-context-pack
Optional? Optional. Without it the pack omits the search-performance summary, weak-CTR list, and ranked-but-unsupported list.

Same envelope shape as keyword_metrics.json. Rows usually fill in impressions, clicks, ctr, position instead of avg_monthly_searches.

data/search_evidence.json (optional)

Generated by site-context-pipeline import-search-evidence --provider <NAME> --source <PATH> --write
Source provider-specific input (e.g. a hand-curated CSV for local-serp-csv)
Read by build-context-pack
Optional? Optional. Without it the pack omits the "What competitors do" section.

A list of SearchEvidence rows (one per organic SERP hit you have captured). Schema fields: query, rank, title, url, snippet, page_type, plus a raw dict for unconsumed CSV columns.

Hand-curated CSVs are the standard input. The toolkit deliberately does not scrape live SERPs in 0.x.

output/agent_context_pack.json

Generated by site-context-pipeline build-context-pack --write
Source data/content_inventory.json, data/internal_link_graph.json, optional data/keyword_metrics.json, optional data/search_performance.json, input/project.md
Read by downstream LLM workflows, human review
Optional? This is the primary output of the toolkit

Machine-readable digest. Stable schema (schema_version integer), designed to be the single document an LLM (or a human reviewer) reads before drafting a brief or a content change. Top-level keys:

Key Description
schema_version integer; bumped only when the shape changes
generated_at ISO-8601 UTC timestamp
client client identifier from the CLI
summary counts: pages, edges, nodes, keyword rows, performance rows
classification.reasons counts per classification reason
pages pages grouped by page_type (home, landing, service, category, blog)
opportunities derived lists (see below)
search_performance_summary totals + impression-weighted average position + average CTR
providers.{keyword_metrics,search_performance} provenance: which adapter produced the data
project_notes verbatim contents of input/project.md
sources absolute paths of the files this pack was built from
warnings machine-readable token list (e.g. missing_keyword_data:...)

opportunities carries five lists:

  • commercial_pages_low_blog_inlinks — service/landing/category pages with zero blog inlinks.
  • blog_pages_low_inlinks — blog pages with at most one inlink.
  • top_keywords — keyword rows ranked by demand (or impressions if demand is missing). Empty when no keyword artifact exists.
  • weak_ctr_pages — query/page rows with ≥ 100 impressions and CTR ≤ 2 %. Empty when no performance artifact exists.
  • ranked_but_unsupported — pages with best position ≤ 20 that receive zero inlinks or zero blog inlinks. Empty when no performance artifact exists.

output/agent_context_pack.md

Generated by site-context-pipeline build-context-pack --write
Source the same data as the JSON pack, rendered as Markdown
Read by humans, code review, LLM prompts
Optional? Always written alongside the JSON pack

Human-readable mirror of the JSON pack. Sections appear conditionally: the Top keyword opportunities, Pages with impressions but weak CTR, Pages with rankings but weak internal support, and Search performance summary sections are emitted only when their underlying artifacts exist. Otherwise a Missing keyword data line points at the import-* commands.

output/content_opportunities.md

Generated by site-context-pipeline build-context-pack --write
Source output/agent_context_pack.json (same in-memory build)
Read by humans (review checklist)
Optional? Always written alongside the pack

Deterministic shortlist of gaps: orphan blog posts, commercial pages with no blog support, weak-CTR queries, ranked-but-unsupported pages. Designed to be a human review prompt rather than a ranking.

Required vs optional, summarised

Artifact Required for the pipeline to run?
data/content_inventory.json yes — every other step reads it
data/internal_link_graph.json optional — pack still builds, with fewer signals
data/keyword_metrics.json optional — adds keyword opportunities to the pack
data/search_performance.json optional — adds performance signals to the pack
data/search_evidence.json optional — adds the "What competitors do" section to the pack
output/agent_context_pack.json always written by build-context-pack
output/agent_context_pack.md always written by build-context-pack
output/content_opportunities.md always written by build-context-pack