Contributing to site-context-pipeline¶
Thanks for your interest. This project values small, well-scoped contributions that keep the toolkit honest: a structured site-context pipeline, not an auto-publish SEO bot.
Ground rules¶
- No client data. Pull requests must not include real domains, real
keyword lists, real briefs, scraped HTML, API keys, or business identifiers.
Use the synthetic demo client (
examples/demo-client/) for fixtures. - No silent network calls. The 0.1 core is standard-library only and stays offline. Network adapters (Wordstat, SERP, LLMs, image APIs) belong behind explicit opt-in flags and optional extras, in separate modules with clear interfaces.
- Tests for new behavior. If you add a classifier rule, an artifact
field, or a CLI flag, add a unit test under
tests/. Tests must run with no internet access. - Backwards-compatible JSON schemas.
data/*.jsonandoutput/agent_context_pack.jsonare public contracts. Add fields, do not rename or remove them. Bumpschema_versionwhen shape changes.
Local setup¶
python -m venv .venv
. .venv/Scripts/activate # Windows
# . .venv/bin/activate # macOS / Linux
pip install -e ".[dev]"
Running checks¶
CI runs the same commands on Python 3.11 and 3.12, plus a coverage
upload to Codecov from the 3.12 job. Mypy runs in --strict mode
against src/site_context_pipeline/ per the configuration in
pyproject.toml.
Pre-commit (optional but recommended)¶
The repo ships a .pre-commit-config.yaml that runs ruff check --fix,
ruff format, and a few file-hygiene hooks (trailing whitespace, EOL,
YAML/TOML syntax). Once installed, every git commit runs them on the
staged files so you do not push CI-failing diffs.
pip install pre-commit
pre-commit install # one-time hook install
pre-commit run --all-files # optional sweep before opening a PR
Performance benchmark¶
A budgeted performance test (tests/test_perf_benchmark.py) runs on
every CI invocation at a small synthetic scale (2,000 URLs) to catch
algorithmic regressions — an accidental O(n²) join would blow the
wall-clock budget on any runner. Scale it up locally with an env var:
For ad-hoc profiling at realistic scale, the standalone
scripts/perf_benchmark.py generates a synthetic site and prints
per-stage timings:
Filing issues¶
When reporting a bug, please include:
- Python version and OS.
- The exact command you ran.
- The shape of your input data (URL count, columns present). Do not paste real client data — synthesize a minimal failing example.
- The relevant snippet of the JSON output, with sensitive values redacted.
Pull requests¶
- Keep the diff focused. One feature or fix per PR.
- Update
README.mdif you add or change a CLI command or artifact field. - If a change deserves a roadmap item, edit the Roadmap section in
README.mdrather than adding new top-level documents.