Sessions Bus contract
Contract for session objects and cluster outputs.
Purpose
The Sessions Bus is the canonical stream of session objects derived from Event Bus events, plus the cluster outputs produced from those sessions.
This page defines the session object schema, the cluster output schema, on-disk layout, manifests, invariants, compatibility rules, smoke test expectations, and failure modes.
What this is
- Sessions are time window groupings of events.
- Every session contains explicit pointers back to the source
event_idvalues. - Clusters are assignments over sessions (and optionally over events through session linkage).
What this is not
- This is not a place to re-parse raw ChatGPT exports or re-compute event ingestion.
- This is not a summarization or enrichment service.
- This is not a vector store contract.
- This is not a UI export format.
Bus endpoints
Sessions daily files
Path pattern:
sessions/daily/YYYY-MM-DD.sessions.jsonl
Rules:
- One JSON object per line
- UTF-8
- Deterministic ordering is recommended but not required
- Must exist even if empty
Sessions daily manifest files
Path pattern:
sessions/manifest/YYYY-MM-DD.sessions.manifest.json
Must exist even if the sessions file is empty.
Clusters daily files
Path pattern:
clusters/daily/YYYY-MM-DD.clusters.jsonlorclusters/daily/YYYY-MM-DD.clusters.parquet
Pick one as canonical and keep it stable. If both exist, the manifest must declare which is canonical.
Rules:
- Must exist even if empty
Clusters daily manifest files
Path pattern:
clusters/manifest/YYYY-MM-DD.clusters.manifest.json
Must exist even if the clusters file is empty.
Canonical session schema
Required fields
-
schema_version
String. Example:session.v1 -
session_id
String. Stable, deterministic for a given input set and window semantics. -
day
String.YYYY-MM-DD. The day partition for this output file. -
window
Object describing how the session window was formed.
Required keys inside window:
-
window_type
String. Example:gap_basedorfixed -
start_ts_ms
Integer. -
end_ts_ms
Integer. -
timezone
String. IANA timezone name. Example:America/New_York -
event_ids
Array of strings. Each must be anevent_idfrom the Event Bus. -
event_count
Integer. Must equallen(event_ids). -
source
Object. Provenance of the sessionization process.
Minimum required keys inside source:
-
input_manifest_day
String.YYYY-MM-DDfor the Event Bus manifest consumed. -
input_manifest_sha256
String. Integrity anchor for the consumed Event Bus daily file. -
sessionizer_version
String. Producer version marker or git commit.
Optional fields
Optional fields may appear and consumers must ignore unknown fields.
Common examples:
conversation_idsarraytitle_candidatesarraylabelsarraysuc_idstringworkspacestringnotesstringmetricsobject such as token counts, duration, etc.
Evidence pointer fields
Sessions must be traceable back to concrete evidence. The minimal evidence is event_ids, but evidence pointers allow stable cross-artifact linking.
Recommended optional evidence fields:
evidenceobject with:eventbus_daily_pathstringeventbus_manifest_pathstringeventbus_sliceobject describing line offsets if you maintain them
Session id derivation
Goal: session_id must be stable when inputs and window semantics are unchanged.
Requirements:
- Deterministic: derived from the ordered set of
event_idvalues plus window semantics. - Include the
dayandwindow_typein the derivation inputs. - Include
start_ts_msandend_ts_msin the derivation inputs. - Use a cryptographic hash of a canonical serialization of these parts.
Prohibited:
- Random UUID generation
- IDs that depend on run time or file iteration order
Window semantics
Window semantics must be explicit so downstream consumers never guess.
The window.window_type must describe how sessions are created, at least at a coarse level.
Examples:
gap_basedwith a fixed inactivity thresholdfixedwith a defined durationconversation_basedif grouping by conversation id
If the algorithm changes in a way that alters session membership, you must:
- Version bump the session schema or the sessionizer version
- Write an ADR
- Include a minimal migration note describing the impact
Canonical cluster schema
Clusters may be produced per day, but cluster identity must be stable within a defined scope.
Required fields
-
schema_version
String. Example:cluster.v1 -
cluster_id
String. Stable identifier within the declared clustering scope. -
scope
Object describing what this clustering run covers.
Required keys inside scope:
-
scope_type
String. Example:dailyorrolling_window -
day
String.YYYY-MM-DDif scope_type is daily -
input_sessions_manifest_sha256
String. Integrity anchor for the consumed sessions output -
members
Array of objects each containing:session_idstringweightnumber or integer, optional but allowed
-
member_count
Integer. Must equallen(members). -
algorithm
Object describing clustering algorithm identity.
Minimum required keys inside algorithm:
namestringversionstringparams_hashstring
Optional fields
labelstringkeywordsarraycentroid_refobjectmetricsobjectstatusstring
Stable cluster id behavior
Clusters are more fragile than sessions because reclustering can reassign memberships.
Rules:
- Cluster ids must be stable within the same inputs and algorithm configuration.
- If the algorithm or parameters change, cluster ids may change and must be treated as a new clustering version.
- The cluster record must carry
algorithm.params_hashso downstream can detect drift.
Recommended derivation:
cluster_id = hash(scope + sorted(session_ids) + algorithm identity)
Endpoints and manifests
Sessions manifest contract
Minimum required fields for YYYY-MM-DD.sessions.manifest.json:
schema_versionstring, example:sessions_manifest.v1bus_schema_versionstring, example:session.v1daystringsessions_pathstringcountsobject:sessions_totalintegerevents_total_referencedinteger
integrityobject:sha256stringbytesinteger
producerobject, recommended:- repo, version, git commit
Clusters manifest contract
Minimum required fields for YYYY-MM-DD.clusters.manifest.json:
schema_versionstring, example:clusters_manifest.v1bus_schema_versionstring, example:cluster.v1daystringclusters_pathstringcountsobject:clusters_totalintegersessions_total_referencedinteger
integrityobject:sha256string for JSONL, or a declared checksum strategy for parquetbytesinteger
producerobject, recommended
Invariants
These rules are mandatory. Violations must trigger stop-the-line behavior.
- Every
session.event_idsentry must reference an existing Event Busevent_idfor the consumed input day set. session_idmust be stable when the same event ids and window semantics are used.- Clusters must reference only existing
session_idvalues from the sessions outputs in scope. - Empty outputs are valid:
- empty sessions file exists
- empty clusters file exists
- both manifests exist and reflect zero counts
- Schema version must be present on every record.
Compatibility rules
- Consumers must ignore unknown optional fields.
- Adding optional fields does not require a schema bump if existing semantics do not change.
- Changes to required fields, window semantics meaning, or id derivation require:
- ADR
- Schema version bump
- Minimal migration note
Backward compatibility window:
- Consumers must support at least the latest and previous schema versions, unless an ADR states otherwise.
Smoke test
Purpose: prove minimal compliance with this contract, including linkage to upstream Event Bus outputs.
Minimal command:
- A repo-specific smoke command is acceptable, but it must validate:
- It can read at least one Event Bus day plus manifest
- It emits sessions daily file and sessions manifest
- It emits clusters daily file and clusters manifest
- JSONL parsing for sessions and clusters if JSONL is used
- Session event ids exist in the input event set
- Session ids are unique within the day file
- Cluster members reference existing session ids
Acceptance criteria:
- All expected output files exist
- All manifests exist
- Manifest counts match parsed counts
- No linkage violations
Error taxonomy mapping
All failures must be recorded in run records with a clear error code. These are recommended codes for the Sessions Bus.
Input missing
MISSING_EVENTBUS_DAILY_FILEMISSING_EVENTBUS_MANIFESTMISSING_EVENTBUS_RANGE(when required day range is unavailable)
Input invalid
EVENTBUS_SCHEMA_MISMATCHEVENTBUS_MALFORMED_JSONLEVENTBUS_MANIFEST_MISMATCH
Session output violations
SESSIONS_MALFORMED_JSONLSESSIONS_SCHEMA_MISMATCHSESSIONS_DUPLICATE_SESSION_IDSESSIONS_REFERENCE_UNKNOWN_EVENT_IDSESSIONS_MANIFEST_MISMATCH
Cluster output violations
CLUSTERS_SCHEMA_MISMATCHCLUSTERS_REFERENCE_UNKNOWN_SESSION_IDCLUSTERS_MANIFEST_MISMATCH
Upstream incomplete
Use when upstream is present but semantically incomplete, such as missing required event fields that block sessionization.
UPSTREAM_INCOMPLETE_REQUIRED_FIELDS
Failure modes and required behavior
Stop-the-line means:
- Fail fast
- Write a run record with error taxonomy
- Do not silently degrade
- Do not emit partial outputs without manifest integrity
Common failure modes:
- Missing event day file or manifest
- Event schema mismatch
- Sessions reference unknown event ids
- Cluster references unknown session ids
- Manifest mismatches on outputs