How this compares¶
site-context-pipeline is a small piece of a larger toolchain. It does
not replace a crawler, an SEO suite, or a CMS. It complements them by
answering one specific question: what does an LLM (or a human writer)
need to read about this site before drafting anything?
This page is an honest, opinionated guide to where the toolkit fits relative to the tools you probably already use. If we genuinely do something better, it's said so. If a commercial tool is the right answer, that's said too.
Bias disclosure. This is the project's own docs page; treat it the way you'd treat any vendor comparison. The matrix below is based on the public capabilities of each tool at the time of writing; details change. If you find an inaccuracy, please open an issue.
At a glance¶
| Capability | site-context-pipeline | Screaming Frog | Sitebulb | ContentKing | Ahrefs / Semrush | Custom Python script |
|---|---|---|---|---|---|---|
| Crawls a live site | no — bring a CSV/sitemap | yes | yes | yes (continuous) | yes | yes (you write it) |
| Reads sitemap.xml | yes (offline) | yes | yes | yes | yes | yes |
| Reads SF exports | yes (internal_*.csv, *_inlinks.csv) |
n/a (it is SF) | partial | no | no | yes (you write it) |
| Internal link graph | yes — joined with inventory | yes | yes (richer) | yes | basic | yes |
| Page-type classification | yes — configurable rules with priority/exclude/allow | basic via filters | yes (segments) | yes | no | yes |
| Keyword volume integration | local CSV, vendor-neutral | no | partial | no | yes (proprietary) | yes |
| Search Console integration | local GSC CSV today; live API on roadmap | no | partial | no | indirect | yes |
| Search-evidence (SERP) rows | local hand-curated CSV | no | no | no | yes (proprietary) | yes |
| Deterministic Markdown QA | yes (9 checks, exit-coded) | no | no | no | no | partial |
| Single-artifact LLM context pack | yes (JSON + Markdown) | no | no | no | no | yes (you write it) |
| Public JSON Schemas | yes (six schemas, in-wheel) | no | no | no | no | n/a |
| Offline / air-gapped CI | yes — base install zero deps | no (desktop app) | no | no (cloud) | no (cloud) | maybe |
| Open source | yes (MIT) | no | no | no | no | yes (yours) |
| Cost | free | paid (license) | paid (subscription) | paid (subscription) | paid (subscription) | engineering time |
Where this toolkit is better¶
For everything below, "better" means better at this specific job — a crawl-heavy job is still a job for Screaming Frog.
- Producing a stable digest for an LLM. No commercial SEO suite publishes a structured, machine-readable, deterministic digest of a site that a model can consume without scraping screenshots of dashboards. That gap is the reason this project exists.
- Reproducible pipelines. Same input → same output, byte-for-byte. CI can compare today's pack against last quarter's via a normal Git diff. Cloud SEO tools change their numbers between page-loads.
- Vendor-neutrality and air-gapped use. Zero runtime dependencies in the base install means it runs on a locked-down CI runner that has no network access. Cloud SEO tools can't.
- Putting human review first. The output is designed to be read by a person before any LLM gets to it. The Markdown twin of the pack is human-friendly; the JSON twin is automation-friendly. Both ship out of the same command.
- Custom classification rules. Configurable rules with priority, exclude patterns, and allow-lists handle messy IAs (commercial pages mixed with content; localised slug variants; legacy URLs). Out-of-the-box page-type buckets in commercial tools rarely fit.
Where this toolkit is not the right answer¶
- You need to crawl a site. Use Screaming Frog, Sitebulb, or Ahrefs Site Audit. This toolkit does not have a crawler in 0.x and has no plans to grow one — it integrates with the output of yours.
- You need a managed SEO platform with dashboards, alerting, and proprietary keyword data. Use Ahrefs, Semrush, or Sitebulb. They cost money for a reason.
- You want continuous monitoring. ContentKing exists for this and is good at it.
- You want a "write me a 1500-word article" button. Out of scope. Forever.
- Your search engine is Yandex (or Baidu, or Naver) only. The core is vendor-neutral and the local CSV providers don't care, so the toolkit works — but you'll bring the data yourself until a community provider for your engine of choice lands.
Common combinations¶
Most teams using this toolkit also use one or more of:
- Screaming Frog → site-context-pipeline. The most common pairing.
Run SF for a fresh crawl, hand its
internal_html.csvandall_inlinks.csvtobuild-inventory --format screaming-frogandbuild-link-graph --format screaming-frog. The toolkit picks up exactly where SF stops being the right tool for the job. - Google Search Console export → site-context-pipeline.
Performance → Export → run
import-search-performance --provider local-gsc-csv. The Markdown pack then has a "weak CTR pages" section your editor can act on. - Ahrefs / Semrush keyword export → site-context-pipeline. Both
vendors export to CSV; the local-csv provider reads either format
with header normalisation, so an Ahrefs
Keyword DifficultyCSV and a SemrushSearch VolumeCSV produce the same artifact shape.
Why not just write a script?¶
A custom Python script is the obvious alternative for a one-off audit. The toolkit beats a script when:
- More than one person reviews the output. A shared, documented, deterministic format means everyone reads the same digest.
- The site is audited more than once. Quarterly audits reuse the same pack format; the deltas across time become meaningful.
- The output goes into an LLM step. Without a stable schema and
citation map, agents drift. The pack's
sourcesfield gives every fact a traceable origin. - You want to plug in vendor APIs later. The provider abstraction
means today's
local-csvimport becomes tomorrow's live API call without touching the rest of the pipeline.
If none of those apply, write the script. The toolkit isn't trying to take work away from anyone.
See also¶
- Architecture — why the pipeline is structured as one-way local-files-only.
- Providers — the four safety rules that keep vendor-specific code optional.
- Recipes — concrete workflows that use this toolkit alongside the tools listed above.