Chunk Bus contract
The Chunk Bus is the canonical contract for document chunks. It is independent from chat events and sessions.
Scope and non scope
This page defines:
- The canonical chunk object schema
- Canonicalization expectations and invariants
- Processed files idempotency rules
- File layout and manifest rules
- Storage boundary rules for caches and vector stores
- Smoke test expectations
This page does not define:
- How to parse a specific upstream document format (PDF, TEI, HTML, Markdown)
- How to run a specific parser implementation
- Retrieval UX, ranking, or prompt design
- Any event bus or session bus schemas
Explicit separation from the Event Bus
Chunk Bus is for documents and their chunked representations.
Event Bus is for atomic message events.
Rules:
- Chunk Bus producers must not emit chat message events
- Event Bus consumers must not treat chunks as events
- If you want to bridge families, do it with an adapter that emits a bus compliant artifact, and declare the seam on the Integration seams page
Canonical chunk schema
A chunk is a self contained unit of document text plus enough metadata to support replay, indexing, and traceability.
Required fields
schema_version- String marker, for example
chunks.v1
- String marker, for example
chunk_id- Stable identifier for the chunk
- Must be deterministic from canonical inputs
document_id- Stable identifier for the source document
text- The chunk text content
provenance- Object describing where this chunk came from and how it was produced
Recommended fields
sourcesource_urior a canonical source referencesource_typesuch aspdf,tei,html,md
span- Structured location information within the document, see Span model below
metadata- Normalized metadata such as title, authors, year, venue, language
canonicalization- Details about normalization steps and their versions
hashes- Content hashes for traceability, see Hash model below
created_at- ISO timestamp when this chunk object was produced
producer- Producer name and version, plus run id if available
Optional fields
tokens- Token counts if computed
quality- Validation or quality signals, for example
is_empty,is_truncated,parse_warnings
- Validation or quality signals, for example
embedding_ref- Pointer to embedding cache entry, never the vector itself
Span model
Span describes where the chunk lives in the source document.
At least one of these must exist if the source supports it:
pagespage_start,page_endintegers
char_rangechar_start,char_endintegers in the canonicalized text stream
section- Section path or heading chain
tei- TEI specific anchors if applicable
Span must be treated as best effort metadata. Missing span is allowed when the upstream format does not support it reliably.
Provenance model
Provenance must allow consumers to verify traceability and reproducibility.
Required elements:
source_uriorsource_refsource_checksum- Hash of the original source file bytes if available
parserparser_name,parser_version
canonicalizercanonicalizer_name,canonicalizer_version
inputs- List of upstream artifacts used to generate this chunk, including their checksums when available
Hash model
Hash fields enable idempotency, validation, and rebuild operations.
Recommended:
text_hash- Hash of the chunk
textafter canonicalization
- Hash of the chunk
chunk_object_hash- Hash of the full canonical chunk object after field ordering normalization
document_object_hash- Optional, if the producer computes a canonical document metadata object
Canonicalization rules
Chunk producers must canonicalize enough to make IDs stable and downstream indexing predictable.
Minimum canonicalization expectations:
- Normalize newlines consistently
- Remove invalid unicode and control characters that break JSONL readers
- Strip leading and trailing whitespace from
text - Enforce that
textis a string, never null - Ensure metadata keys are stable and use one naming convention
Canonicalization must not:
- Rewrite meaning or paraphrase content
- Invent metadata that is not sourced
- Drop text silently
If a producer must drop content, it must record the reason and counts in the manifest and run record.
Stable ID rules
document_id
document_id must be stable across reruns for the same source document.
Recommended derivation:
- Use a deterministic stable id derived from:
- normalized source uri or stable path
- source checksum
- optional normalized title if uri is unavailable
chunk_id
chunk_id must be stable across reruns given the same canonicalization and chunking logic.
Recommended derivation inputs:
document_id- a stable span key if available, otherwise an ordered index within the document
text_hash
If chunking boundaries change, chunk ids may change. That is acceptable only when:
- the schema version or chunking version changes
- the manifest records the change
- the migration note and ADR rules are followed
Processed files idempotency
Chunk Bus producers must be idempotent at the file ingestion level.
Rules:
- A source file must not be reprocessed if it has already been processed with the same source checksum, parser version, and canonicalizer version
- The system must record processed state in a durable store:
- sqlite table, jsonl ledger, or equivalent
- Idempotency state must be queryable and auditable
Required processed file record fields:
source_urior source pathsource_checksumparser_name,parser_versioncanonicalizer_name,canonicalizer_versionprocessed_atrun_idor run record pointerstatusanderror_typeif failed
Failure handling:
- A failed processing attempt must be recorded, with an error taxonomy code
- Retrying must not erase prior failures, it should append a new attempt record
Storage boundary
Chunk Bus is the canonical source of truth for document text units.
Allowed persisted state:
- Canonical chunk JSONL files plus manifests
- Processed files idempotency ledger
- Optional caches that are derived and rebuildable:
- embedding cache sqlite
- retrieval indexes
- vector store directories
Not allowed as the only source of truth:
- Vector store as the only store of chunk text
- Any store that cannot be rebuilt from canonical chunks
Embedding cache interaction
Rules:
- Embeddings must be derived from chunk
textand include a reference to:- embedding model name and version
- embedding parameters if relevant
text_hashthat the embedding was computed from
- Chunk bus must not embed vectors into the chunk object
- Chunk objects may include
embedding_refpointers only
Rebuild indexes from chunks
This means:
- Given only the canonical chunk JSONL files plus manifests, a system can recreate:
- vector stores
- retrieval indexes
- any embedding caches
- any secondary views used for browsing or search
If a rebuild cannot happen, then some required information is missing from the chunk schema or manifests.
File endpoints
The Chunk Bus is defined by the canonical chunks directory plus manifests.
Canonical chunks
chunks/canonical/YYYY-MM-DD.jsonlor an equivalent partition scheme- Alternative allowed partitions:
- by document source type
- by collection name
- by document id prefix
Partition choice must be stable and documented in the repo runbook.
Manifests
chunks/manifest/YYYY-MM-DD.manifest.json
Manifest must exist even if the output is empty for that partition.
Manifest rules
A manifest must include enough information to validate integrity and enable deterministic downstream behavior.
Required manifest fields:
schema_versionpartition_key- date or collection name
created_atproducer- name and version
counts- number of documents processed
- number of chunks emitted
- number of failures
checksums- checksum of the chunks file or set of files
idempotency- counts of skipped due to already processed
errors- aggregated error taxonomy counts
Recommended:
input_sources- list of source uris and their checksums, possibly truncated with a pointer to a detailed ledger
canonicalization_versionschunking_policy_idor hash
Compatibility rules
Schema evolution rules:
- New optional fields may be added without a version bump if they are backward compatible
- Any change that affects:
- required fields
- id derivation logic
- canonicalization semantics
- manifest required fields requires:
- a schema version bump
- an ADR
- a migration note
Consumers must:
- accept unknown fields
- reject missing required fields
- validate
schema_versionand fail fast if unsupported
Smoke test
Purpose:
- Validate that canonical chunks can be loaded and indexed without any vector store present
- Validate manifest existence and basic integrity checks
Smoke test must validate:
- Every JSONL line is valid JSON
- Every object has required fields and the correct
schema_version chunk_iduniqueness within the partitiontextnon null and string typed- Manifest exists and
counts.chunks_emittedmatches observed count - Checksums match expected if present
Smoke test must not validate:
- Embedding quality
- Retrieval ranking
- Parser completeness
Minimal acceptance criteria:
- Running the smoke test on an empty partition produces a valid manifest and passes
- Running on a non empty partition validates all required invariants
Failure modes
Common failure modes and required responses:
- Missing chunk file for a manifested partition
- Record error
MISSING_OUTPUT:chunks_file
- Record error
- Manifest missing
- Record error
MISSING_OUTPUT:manifest
- Record error
- Malformed JSONL
- Record error
SCHEMA_INVALID:json_parse
- Record error
- Missing required fields
- Record error
SCHEMA_INVALID:required_field_missing
- Record error
- Duplicate chunk ids
- Record error
INTEGRITY_VIOLATION:duplicate_ids
- Record error
- Manifest count mismatch
- Record error
INTEGRITY_VIOLATION:manifest_mismatch
- Record error
- Source checksum missing where required by policy
- Record error
PROVENANCE_INVALID:missing_source_checksum
- Record error
All failures must be emitted as run record entries. Downstream consumers must fail fast when these invariants are violated.