Classifier rules¶

The inventory builder classifies every URL into one of six page types: home, service, blog, category, landing, other. The result lands in data/content_inventory.json and drives downstream artefacts (link graph opportunities, the context pack's "pages by type" section, etc.).

The toolkit ships a built-in rule set that handles common URL shapes (/blog/, /services/, /category/, /pricing, …). When your site uses a different convention, drop a clients/<id>/config/classifier.json into the workspace.

Resolution order¶

For each URL, the classifier checks in this order:

Explicit commercial URL list — the URL appears in config/commercial_urls.json. The page type becomes landing; the reason becomes matched_commercial_url_list.
Home page — the URL's path is / or empty. The page type becomes home; the reason becomes matched_home_path.
Per-rule allow lists — any rule with the URL in its allow_urls fires regardless of pattern. The reason becomes matched_allow_url:<page_type>.
Pattern rules — evaluated in priority order (lowest number first; ties broken by list order). A rule's exclude_patterns block the match; the next rule then gets a chance. The reason becomes matched_pattern:<pattern>.
Fallback — nothing matched. Page type other, reason fallback_other.

The reason string is part of the public artefact contract — write audits and dashboards against it.

Schema¶

Minimal (legacy, still supported)¶

{
  "rules": [
    { "page_type": "blog",    "pattern": "*/blog/*"     },
    { "page_type": "service", "pattern": "*/services/*" },
    { "page_type": "landing", "pattern": "*/pricing*"   }
  ]
}

First match wins. Identical to how the toolkit shipped before this schema was extended.

Extended¶

Every rule may carry the same two required fields plus three optional fields:

Field	Type	Default	Notes
`page_type`	string	required	One of `home`, `service`, `blog`, `category`, `landing`, `other`. Unknown values skip the rule with a warning.
`pattern`	string	required	Glob with `*` wildcards. Matched against the URL's path (lower-cased).
`priority`	int	`100`	Lower wins. Ties broken by the rule's position in the JSON list, so the order in your file still matters.
`exclude_patterns`	list of glob strings	`[]`	If any of these matches, the rule is skipped and the next rule (in priority order) gets a chance.
`allow_urls`	list of full URLs	`[]`	If the URL is on this list, the rule fires unconditionally — the pattern is ignored. Useful for one-off promotions of a specific URL.

{
  "rules": [
    {
      "page_type": "blog",
      "pattern": "*/blog/*",
      "priority": 10,
      "exclude_patterns": ["*/blog/archive/*"]
    },
    {
      "page_type": "category",
      "pattern": "*/blog/archive/*",
      "priority": 20
    },
    {
      "page_type": "service",
      "pattern": "*/services/*",
      "priority": 30,
      "allow_urls": [
        "https://example.com/special-bundle/"
      ]
    },
    {
      "page_type": "landing",
      "pattern": "*/pricing*",
      "priority": 40
    }
  ]
}

Reading this:

All /blog/... URLs are blogs (priority 10), except those under /blog/archive/ — those are caught by the next rule and become categories.
All /services/... URLs are services. So is the one-off /special-bundle/ URL, even though it does not look like a services URL — it is allow-listed.
All /pricing* URLs are landings.

How invalid input is handled¶

The toolkit prefers to keep going. Each invalid rule is skipped and recorded in the inventory's warnings list:

Symptom	Warning token
Rule entry is not a JSON object	`classifier_rule_not_object:index=N`
Rule missing `page_type` or `pattern`	`classifier_rule_missing_fields:index=N`
Rule has an unknown `page_type`	`classifier_rule_invalid_page_type:index=N,value=X`
Rule has a non-int `priority`	`classifier_rule_invalid_priority:index=N`
Rule has a non-list `exclude_patterns`	`classifier_rule_invalid_exclude_patterns:index=N`
Rule has a non-list `allow_urls`	`classifier_rule_invalid_allow_urls:index=N`
Whole file is not valid JSON	`classifier_json_invalid` (built-in defaults used)
File is JSON but `rules` is empty	`classifier_json_empty_using_defaults`

These warnings appear inside the inventory's standard CLI payload so they are visible in CI logs and easy to grep.

When to use the extended schema¶

Negation (exclude_patterns) — when a single broad pattern almost works but you have a sub-tree that needs different treatment. Example: blog posts are at /blog/*, but legacy archives at /blog/archive/* should be treated as categories.
Forced matches (allow_urls) — when a single URL does not fit any pattern but you know it belongs to a specific type. Example: a high-priority landing page that lives at the site root.
Explicit priority — when you mix narrow and broad rules and want the narrow ones to win regardless of file order. Example: a more specific /services/local-delivery/* rule with priority 10 that beats a generic /services/* rule with priority 50.

If none of these apply, the legacy flat schema is fine — keep it.