Worked example — one incident, nine agents, one report

A representative AI safety incident arrives at 09:00 Tuesday morning. By Wednesday 06:00 the AI Guardrail Lab has read it, analyzed it through five lenses, designed controls, captured evidence, and produced a ready-to-circulate report.

This page walks the trace end-to-end. It is the single best demonstration of what the 9-agent Guardrail Lab actually does.

The incident

Tuesday 09:00. Incident ID: INC-2026-0142. Source: OECD AI Incidents Monitor (AIID).

A pattern that recurs in real incident databases:

An AI-powered customer-support chatbot deployed by a mid-size airline confidently quoted a bereavement refund policy that did not exist. A grieving customer purchased an expensive last-minute ticket relying on the chatbot’s promise. The carrier later denied the refund. A small-claims court ruled the chatbot’s statement was binding on the carrier — refund must be honored. Public-facing reputational damage followed.

System type: LLM customer-support chatbot. Deployment context: post-purchase customer service. Harm type: misinformation · financial · reputational. Severity: 4/5.

A real-world pattern. The incident appears as one JSONL row in the dataset configured for the Guardrail Lab.

The trace

T+0 (09:00:00) — Incident lands

The Incident Collector watches a configured incidents feed (OECD AIID + AIAAIC + Stanford + Damien Charlotin tracker). A new row matching the configured filter (severity ≥ 3, public chatbot, last-72h) arrives.

~/incidents/raw/2026-05-19_INC-2026-0142.jsonl

Nothing else happens yet. The office is silent.

T+5 seconds (09:00:05) — Stage 1: Incident Collector fires

Agent #1 of the AI Guardrail Lab. Watching the feed, it picks up the new row.

Reads the JSONL record
Validates schema (required fields present)
Writes a row to the incidents table: id=INC-2026-0142, severity=4, system_type="LLM chatbot", deployment_context="customer support", harm=["misinformation","financial","reputational"], received_at=2026-05-19T09:00:00
Emits an event: incidents:INC-2026-0142 new

Cost: Haiku, ~$0.0001. Time: ~3s.

T+15 seconds (09:00:15) — Stage 2: Converter

The Converter normalises raw JSONL into an agent-readable .md sidecar.

~/library/readable/incidents/2026-05-19_INC-2026-0142.md

With frontmatter:

source_url: file:///incidents/raw/2026-05-19_INC-2026-0142.jsonl
source_row: incidents:INC-2026-0142
system_type: LLM chatbot
context: customer support
severity: 4
harm_types: [misinformation, financial, reputational]

The body becomes plain markdown describing the incident in 200-400 words — facts only, no analysis.

Cost: Haiku, ~$0.0005. Time: ~10s.

T+30 seconds → T+5 minutes (09:00:30 – 09:05:00) — Stage 3: five analyzers fan out

This is the most important stage. Five of the Guardrail Lab’s nine agents read the same .md through different lenses. They run in parallel.

Each writes a post-it — a structured note — to the agent_postits table. Same source_url, same casefile_anchor (chatbot-hallucinated-policy), different perspective.

Agent #2 · Root Cause — technical lens

perspective: root-cause
casefile: chatbot-hallucinated-policy
importance: 5 (critical)
body:
  Technical root cause: LLM hallucination on under-trained policy domain.
  System prompt did not include the actual refund policy text. Retrieval layer
  was absent — chatbot answered from training-data priors rather than
  authoritative sources. No grounding citations enforced. No confidence
  threshold applied. Output post-processing did not detect the policy claim
  for verification.
  Contributing factors: deployment shipped without policy-domain test suite;
  no canary on hallucination rate; no human-in-the-loop on refund claims.
source_row_ids: [incidents:INC-2026-0142]

Agent #3 · Threat Modeling — adversarial lens

perspective: threat-model
casefile: chatbot-hallucinated-policy
importance: 4 (high)
body:
  Attack/abuse vectors enabled by this failure pattern:
  - Prompt-injection: user could engineer the chatbot to promise more
    favourable terms (jailbreak via emotional framing observed in this case
    even without malicious intent).
  - Adversarial extraction: attacker discovers chatbot accepts policy
    fabrications → uses transcripts as leverage in disputes.
  - Reputational denial-of-service: bad-faith actors flood support with
    edge cases hoping chatbot fabricates costly promises.
  Damage vectors: direct financial loss per honored fabrication +
  reputational cascade once one ruling becomes public.
source_row_ids: [incidents:INC-2026-0142]

Agent #4 · Guardrail Designer — prevention lens

perspective: guardrail
casefile: chatbot-hallucinated-policy
importance: 5 (critical)
body:
  Preventive controls:
  1. Retrieval-grounded responses — every policy claim must cite an
     authoritative document; no claim without citation is allowed to ship.
  2. Domain-specific output filter — regex + classifier to detect
     refund/policy statements; route to human review.
  3. Confidence threshold — refuse to commit to policy details below 0.85
     model confidence on a calibrated test set.
  Detection controls:
  4. Hallucination canary — daily sample of 100 chatbot responses scored by
     ground-truth checker; alert if rate >2%.
  5. Customer-facing disclaimer — every policy statement carries "verify with
     a human agent for binding details" suffix.
  Blocking controls:
  6. Hard block on refund-amount commitments above $X without human approval.
source_row_ids: [incidents:INC-2026-0142]

Agent #5 · Dev Process — SDLC lens

perspective: dev-process
casefile: chatbot-hallucinated-policy
importance: 3 (medium)
body:
  Where guardrails insert into the SDLC:
  - Pre-deploy: policy-domain test suite (1000+ Q&A pairs from authoritative
    docs) must pass at 95% accuracy.
  - CI gate: hallucination rate on canary set must stay <2% rolling.
  - Pre-release: red-team simulation including prompt-injection attempts on
    refund/policy questions.
  - Post-deploy: weekly review of customer transcripts flagged by detection
    filter.
  Required role additions: policy SME embedded in pre-deploy reviews;
  on-call rotation for hallucination-canary alerts.
source_row_ids: [incidents:INC-2026-0142]

Agent #9 · Critic — gap-finding lens

perspective: critic
casefile: chatbot-hallucinated-policy
importance: 4 (high)
body:
  What the prior four analyses missed:
  - Legal layer: chatbot output as binding contract is a jurisdictional risk
    needing legal review, not just technical guardrails.
  - Contractual layer: customer ToS may need updating to clarify chatbot
    statements are advisory, not binding. (Legal must decide whether such
    clauses survive scrutiny — see Mata-v-Avianca pattern.)
  - Organizational layer: who owns the chatbot? If product owns it but
    support handles consequences, there's a misaligned-incentive gap.
  - Audit layer: how do we evidence "we did our diligence" after an incident?
    The Guardrail Designer's controls need an audit trail showing they ran.
  Recommendation: cross-functional governance loop, not just engineering controls.
source_row_ids: [incidents:INC-2026-0142]

Five rows in agent_postits. Same source. Five readings. Complementary, not contradictory — each lens surfaces what the others miss.

Cost: Sonnet × 4 + Opus × 1 (Critic), ~$0.12 total. Time: ~4 minutes (parallel).

T+10 minutes (09:10:00) — Stage 4: Evidence & Audit clusters

Agent #8. Watches agent_postits for new rows clustering around a casefile. When five new post-its land within a 10-minute window, it fires.

Acquires a per-casefile lock
Reads all 5 new post-its + the existing story.md for chatbot-hallucinated-policy
Reads related historical incidents (other chatbot-hallucination patterns)
Rewrites the casefile story:

~/library/stories/chatbot-hallucinated-policy/story.md

Excerpt:

# Pattern: AI chatbot fabricates binding policy commitments

**Pattern type:** LLM hallucination · customer-facing · contract-binding outcome.
**Severity baseline:** 4/5 (direct financial + reputational).
**Recurrence:** 3rd related incident in 18 months (precedents: INC-2025-0089, INC-2025-0144).

## What we now know
- Without retrieval grounding, LLM customer-support agents hallucinate policies
  with non-trivial frequency.
- Courts have started treating chatbot outputs as binding on the deployer.
- Engineering-only controls are insufficient; legal + contractual + organizational
  controls also required.

## Required controls (from this incident's analyses)
- [Root Cause] Retrieval grounding + confidence threshold + post-process filter
- [Threat Model] Prompt-injection hardening + abuse-pattern monitoring
- [Guardrail] 6 layered controls (preventive · detection · blocking)
- [Dev Process] Pre-deploy test suite · CI canary · red-team · embed policy SME
- [Critic] Cross-functional governance loop including legal + product + support

## Citations
- post-its: [postit:88432, postit:88433, postit:88434, postit:88435, postit:88436]
- source: incidents:INC-2026-0142
- related: incidents:INC-2025-0089, incidents:INC-2025-0144

Story archived to ~/library/stories/chatbot-hallucinated-policy/archive/2026-05-19T09-09-50.md (the previous version, in case anyone needs to audit when the analysis changed).

Cost: Sonnet, ~$0.02. Time: ~2 minutes.

T+next-day 06:00 (Wednesday 06:00:00) — Stage 5: Person-watcher writes the brief

A daily synthesis agent runs at 06:00. It reads:

Every active casefile story (this morning, including the new one)
Recent pinned facts
Open action items in user_tasks
Yesterday’s leave-notes from each Guardrail Lab agent

It writes one document — ~/library/final/daily-ai-safety-brief.md — capturing everything the office knows this morning.

Excerpt for this incident’s section:

## NEW · Chatbot hallucination as binding contract — 3rd recurrence in 18 months

Pattern recurred yesterday (INC-2026-0142, severity 4). Standard control set
required: retrieval grounding · confidence threshold · output filter · canary ·
human-on-refunds · prompt-injection hardening · cross-functional governance.

**Open action**: organizations deploying customer-support LLM should ship the
control bundle before next deployment review. Audit trail required.

**Related**: INC-2025-0089 (same pattern), INC-2025-0144 (same pattern). Trend.

**Critic flag**: recurrence rate suggests industry-wide gap, not single-org issue.
Recommend escalation to AI safety governance body.

This document is signed + hashed. The hash lives in the DB so cold-start can verify nothing tampered with the file.

Cost: Opus, ~$0.30 (daily, amortised across all active casefiles). Time: ~5 minutes.

T+next-day 09:00 (Wednesday 09:00:00) — The engineer wakes up

An NBS engineer opens Claude Code at their workstation. The Guardrail Lab agent — a single user-facing agent the engineer interacts with — wakes up. SessionStart hook fires.

The hook reads daily-ai-safety-brief.md and injects ~10K tokens of context into the agent’s system prompt — the agent already knows everything before the engineer says a word.

The engineer types “summarise yesterday’s incidents and prep a control checklist.” The agent doesn’t ask “which incidents?” or “what’s the context?” It just produces:

A two-paragraph summary of INC-2026-0142
The recurrence pattern flag (3rd in 18 months)
A ready-to-circulate control checklist (Markdown, formatted, copy-paste)
A draft escalation note for AI safety governance

All cited back through the chain to raw incident bytes.

Why this is the demonstration that matters

Three things this trace proves that nothing else does:

Claim	How this trace proves it
The office reads through perspectives	Five post-its, five lenses, one source. Root Cause sees mechanism; Threat Model sees abuse vectors; Guardrail Designer sees controls; Dev Process sees SDLC integration; Critic sees what the others miss. None contradicts.
Stories cluster, not pile	Evidence & Audit takes five post-its and writes one coherent narrative connecting this incident to prior recurrences. Not a dump — a synthesis.
Cold-start is free for the engineer	The agent wakes up knowing yesterday’s incident, the pattern, the controls. No “explain the context” prompt. No “what’s the situation” handshake.
Provenance is real	Every claim cites a post-it which cites a source row which cites raw JSONL bytes. Anyone asking “where did you get this?” gets a verifiable chain back to ground truth.

The cost end-to-end

For one incident through the full pipeline:

Stage	Agent	Model	Cost	Time
1. Incident Collector	#1	Haiku	$0.0001	3s
2. Converter	(built-in)	Haiku	$0.0005	10s
3. 5 analyzers (parallel)	#2, #3, #4, #5, #9	Sonnet × 4 + Opus × 1	$0.12	4 min
4. Evidence & Audit (Story-builder role)	#8	Sonnet	$0.02	2 min
5. Daily synthesis	(Person-watcher role)	Opus	~$0.001 / incident	5 min daily
Total per incident			~$0.14	~6 min (mostly parallel)

At 30 incidents / week, that’s ~$17/month for an entire AI safety incident-analysis office running continuously.

The 9 agents in this trace

#	Agent	Role in this trace
1	Incident Collector	Pulled INC-2026-0142 from the feed
2	Root Cause	Wrote the technical-lens post-it
3	Threat Modeling	Wrote the adversarial-lens post-it
4	Guardrail Designer	Wrote the controls post-it
5	Dev Process	Wrote the SDLC-integration post-it
6	Policy-as-Code	(not invoked in this trace — engages when controls land in code)
7	Claude Hook	(not invoked — engages when SDK-specific guardrails needed)
8	Evidence & Audit	Clustered the post-its into a story with provenance
9	Critic	Wrote the gap-finding post-it (what the others missed)

Agents #6 and #7 engage in the next phase — when the controls become code and CI checks. That extends the trace by ~30 minutes and produces deployable artifacts.