Event Bus contract
Authoritative contract for atomic message events.
Purpose
The Event Bus is the canonical, append-only stream of message events. It is the primary seam that downstream consumers rely on.
This page defines the event object schema, on-disk layout, manifest rules, and invariants. It does not define how to parse ChatGPT exports or any other upstream raw source format.
Scope
This contract defines:
- Canonical event object schema and schema version marker
- Stable ID rules and timestamp normalization rules
- Provenance fields required for replay and traceability
- File layout and endpoint patterns
- Manifest format requirements and integrity checks
- Compatibility and evolution rules for schema changes
- Smoke test expectations
- Failure modes and required responses
Non-scope
This contract does not define:
- How raw sources are discovered, downloaded, or authenticated
- Parsing logic for ChatGPT export formats
- Enrichment logic such as tagging, categorization, or summaries
- Any vector store or database indexing strategy
Bus endpoints
Daily event files
Path pattern:
eventbus/daily/YYYY-MM-DD.jsonl
Rules:
- One JSON object per line
- UTF-8
- Append-only for a given day
- Must exist even if empty
Daily manifest files
Path pattern:
eventbus/manifest/YYYY-MM-DD.manifest.json
Rules:
- Must exist even if the day file is empty
- Must reference the corresponding daily file path
- Must include schema version, counts, and integrity fields
Canonical event schema
Required fields
All events must include the following fields.
-
schema_version
String. Example:event.v1 -
event_id
String. Stable ID. Deterministic across re-runs when source data is unchanged. -
timestamp_ms
Integer. Epoch milliseconds. -
role
String enum. Allowed values:user,assistant,system,tool -
content
String. The message text. Empty string is allowed. -
source
Object. Provenance for replay and traceability.
Minimum required keys inside source:
-
source_system
String. Example:chatgpt_export -
source_record_id
String. The upstream identifier if available, else a deterministic surrogate. -
source_uri
String. Path or identifier of the raw source file or dataset.
Optional fields
Optional fields may appear. Consumers must ignore unknown optional fields.
Common examples:
-
conversation_id
String. -
title
String. -
language
String. -
tokens
Object with token counts and model details, if computed. -
enrichment
Object. Must be clearly marked as derived output if present.
Example event object
{
"schema_version": "event.v1",
"event_id": "evt_2f6d2c7b9b9d0a4e",
"timestamp_ms": 1672614871597,
"role": "assistant",
"content": "Example content",
"source": {
"source_system": "chatgpt_export",
"source_record_id": "e09cd21a-9637-4b9e-9a68-38ce89d74d6e",
"source_uri": "raw/chatgpt/conversations.json"
},
"conversation_id": "4db5e2c6-07ff-4822-8286-c650a852f03b",
"title": "Keyboard Shortcuts for Nano Text Editor"
}
Stable ID rules
Goal: event_id must not change across re-runs when the same upstream content is reprocessed.
Requirements:
-
Deterministic: derived from stable upstream identifiers when available
-
If upstream identifiers are missing, derive from a stable tuple such as:
- source_system
- source_uri
- conversation_id if available
- timestamp_ms
- role
- content hash
-
Collision resistant: must be cryptographic hash based
Prohibited:
- Random UUID generation at ingest time
- IDs that depend on run timestamp or file ordering
Timestamp normalization rules
-
timestamp_msmust be epoch milliseconds as integer -
If upstream provides seconds or ISO timestamps, convert to milliseconds
-
Allowed range:
- Must be within a plausible operational window
- Must not be negative
-
If timestamp cannot be recovered:
- The event must be rejected
- The run record must include an error and the source pointer
Provenance rules
sourcemust allow replaysource_urimust point to a retained raw input or a canonical ingest snapshot- If the same raw file is re-ingested, the resulting events must have the same
event_idvalues
Invariants
These rules are mandatory. Violations must trigger stop-the-line behavior.
- Daily files are append-only
- Stable IDs must not change across re-runs
- Manifest exists for every day, even for empty day files
schema_versionexists on every event- Each daily file must be valid JSONL
event_idmust be unique within a day file
Manifest contract
Minimum required fields for YYYY-MM-DD.manifest.json:
-
schema_versionString. Example:event_manifest.v1 -
bus_schema_versionString. Example:event.v1 -
dayString.YYYY-MM-DD -
daily_pathString. Path to the daily JSONL file -
countsObject with:events_totalintegerevents_by_roleobject mapping role to integer
-
integrityObject with:sha256string for the daily file contentbytesinteger for daily file size
Optional fields:
producerobject: repo, version, git commitgenerated_attimestamp
Compatibility rules
Schema evolution must be controlled and predictable.
-
Consumers must ignore unknown fields
-
Producers may add optional fields without a version bump if:
- Existing required fields are unchanged
- Semantics of existing fields are unchanged
-
Any change to required fields, meaning, or ID rules requires:
- ADR
- Schema version bump
- Minimal migration note
Backward compatibility window:
- Consumers must support at least the latest and previous schema versions, unless an ADR states otherwise.
Smoke test
Purpose: prove that a minimal run can produce contract-compliant outputs.
Minimal command:
-
A repo-specific smoke command is acceptable, but it must validate:
- At least one day file path is produced, even if empty
- A manifest is produced for that day
- JSONL parses
- No duplicate
event_idvalues in the file - Manifest counts match parsed counts
Expected validations:
- JSONL parse for the daily file
- Schema presence checks for required fields
- Manifest presence and checksum match
Failure modes and required behavior
Stop-the-line means:
- Fail fast
- Write a run record with error taxonomy
- Do not silently degrade
- Do not emit partial outputs without manifest integrity
Missing day file
Symptoms:
eventbus/daily/YYYY-MM-DD.jsonlmissing
Required response:
- Fail the run
- Record error:
MISSING_DAILY_FILE
Malformed JSONL
Symptoms:
- A line is not valid JSON
- Encoding errors
Required response:
- Fail the run
- Record error:
MALFORMED_JSONL - Include source pointer to offending line number if possible
Duplicate event IDs
Symptoms:
- Two lines share the same
event_id
Required response:
- Fail the run
- Record error:
DUPLICATE_EVENT_ID
Timestamp out of range
Symptoms:
- Negative timestamp or implausible date
Required response:
- Reject event
- Fail the run if any rejected events occur
- Record error:
TIMESTAMP_OUT_OF_RANGE
Manifest mismatch
Symptoms:
- Parsed count differs from manifest count
- SHA256 differs
Required response:
- Fail the run
- Record error:
MANIFEST_MISMATCH - Recommend rebuild of that day output as remediation