Worked example — one incident, nine agents, one report
A representative AI safety incident arrives at 09:00 Tuesday morning. By Wednesday 06:00 the AI Guardrail Lab has read it, analyzed it through five lenses, designed controls, captured evidence, and produced a ready-to-circulate report.
This page walks the trace end-to-end. It is the single best demonstration of what the 9-agent Guardrail Lab actually does.
The incident
Tuesday 09:00. Incident ID: INC-2026-0142. Source: OECD AI Incidents Monitor (AIID).
A pattern that recurs in real incident databases:
An AI-powered customer-support chatbot deployed by a mid-size airline confidently quoted a bereavement refund policy that did not exist. A grieving customer purchased an expensive last-minute ticket relying on the chatbot’s promise. The carrier later denied the refund. A small-claims court ruled the chatbot’s statement was binding on the carrier — refund must be honored. Public-facing reputational damage followed.
System type: LLM customer-support chatbot. Deployment context: post-purchase customer service. Harm type: misinformation · financial · reputational. Severity: 4/5.
A real-world pattern. The incident appears as one JSONL row in the dataset configured for the Guardrail Lab.
The trace
T+0 (09:00:00) — Incident lands
The Incident Collector watches a configured incidents feed (OECD AIID + AIAAIC + Stanford + Damien Charlotin tracker). A new row matching the configured filter (severity ≥ 3, public chatbot, last-72h) arrives.
~/incidents/raw/2026-05-19_INC-2026-0142.jsonlNothing else happens yet. The office is silent.
T+5 seconds (09:00:05) — Stage 1: Incident Collector fires
Agent #1 of the AI Guardrail Lab. Watching the feed, it picks up the new row.
- Reads the JSONL record
- Validates schema (required fields present)
- Writes a row to the
incidentstable:id=INC-2026-0142, severity=4, system_type="LLM chatbot", deployment_context="customer support", harm=["misinformation","financial","reputational"], received_at=2026-05-19T09:00:00 - Emits an event:
incidents:INC-2026-0142 new
Cost: Haiku, ~$0.0001. Time: ~3s.
T+15 seconds (09:00:15) — Stage 2: Converter
The Converter normalises raw JSONL into an agent-readable .md sidecar.
~/library/readable/incidents/2026-05-19_INC-2026-0142.mdWith frontmatter:
source_url: file:///incidents/raw/2026-05-19_INC-2026-0142.jsonlsource_row: incidents:INC-2026-0142system_type: LLM chatbotcontext: customer supportseverity: 4harm_types: [misinformation, financial, reputational]The body becomes plain markdown describing the incident in 200-400 words — facts only, no analysis.
Cost: Haiku, ~$0.0005. Time: ~10s.
T+30 seconds → T+5 minutes (09:00:30 – 09:05:00) — Stage 3: five analyzers fan out
This is the most important stage. Five of the Guardrail Lab’s nine agents read the same .md through different lenses. They run in parallel.
Each writes a post-it — a structured note — to the agent_postits table. Same source_url, same casefile_anchor (chatbot-hallucinated-policy), different perspective.
Agent #2 · Root Cause — technical lens
perspective: root-causecasefile: chatbot-hallucinated-policyimportance: 5 (critical)body: Technical root cause: LLM hallucination on under-trained policy domain. System prompt did not include the actual refund policy text. Retrieval layer was absent — chatbot answered from training-data priors rather than authoritative sources. No grounding citations enforced. No confidence threshold applied. Output post-processing did not detect the policy claim for verification. Contributing factors: deployment shipped without policy-domain test suite; no canary on hallucination rate; no human-in-the-loop on refund claims.source_row_ids: [incidents:INC-2026-0142]Agent #3 · Threat Modeling — adversarial lens
perspective: threat-modelcasefile: chatbot-hallucinated-policyimportance: 4 (high)body: Attack/abuse vectors enabled by this failure pattern: - Prompt-injection: user could engineer the chatbot to promise more favourable terms (jailbreak via emotional framing observed in this case even without malicious intent). - Adversarial extraction: attacker discovers chatbot accepts policy fabrications → uses transcripts as leverage in disputes. - Reputational denial-of-service: bad-faith actors flood support with edge cases hoping chatbot fabricates costly promises. Damage vectors: direct financial loss per honored fabrication + reputational cascade once one ruling becomes public.source_row_ids: [incidents:INC-2026-0142]Agent #4 · Guardrail Designer — prevention lens
perspective: guardrailcasefile: chatbot-hallucinated-policyimportance: 5 (critical)body: Preventive controls: 1. Retrieval-grounded responses — every policy claim must cite an authoritative document; no claim without citation is allowed to ship. 2. Domain-specific output filter — regex + classifier to detect refund/policy statements; route to human review. 3. Confidence threshold — refuse to commit to policy details below 0.85 model confidence on a calibrated test set. Detection controls: 4. Hallucination canary — daily sample of 100 chatbot responses scored by ground-truth checker; alert if rate >2%. 5. Customer-facing disclaimer — every policy statement carries "verify with a human agent for binding details" suffix. Blocking controls: 6. Hard block on refund-amount commitments above $X without human approval.source_row_ids: [incidents:INC-2026-0142]Agent #5 · Dev Process — SDLC lens
perspective: dev-processcasefile: chatbot-hallucinated-policyimportance: 3 (medium)body: Where guardrails insert into the SDLC: - Pre-deploy: policy-domain test suite (1000+ Q&A pairs from authoritative docs) must pass at 95% accuracy. - CI gate: hallucination rate on canary set must stay <2% rolling. - Pre-release: red-team simulation including prompt-injection attempts on refund/policy questions. - Post-deploy: weekly review of customer transcripts flagged by detection filter. Required role additions: policy SME embedded in pre-deploy reviews; on-call rotation for hallucination-canary alerts.source_row_ids: [incidents:INC-2026-0142]Agent #9 · Critic — gap-finding lens
perspective: criticcasefile: chatbot-hallucinated-policyimportance: 4 (high)body: What the prior four analyses missed: - Legal layer: chatbot output as binding contract is a jurisdictional risk needing legal review, not just technical guardrails. - Contractual layer: customer ToS may need updating to clarify chatbot statements are advisory, not binding. (Legal must decide whether such clauses survive scrutiny — see Mata-v-Avianca pattern.) - Organizational layer: who owns the chatbot? If product owns it but support handles consequences, there's a misaligned-incentive gap. - Audit layer: how do we evidence "we did our diligence" after an incident? The Guardrail Designer's controls need an audit trail showing they ran. Recommendation: cross-functional governance loop, not just engineering controls.source_row_ids: [incidents:INC-2026-0142]Five rows in agent_postits. Same source. Five readings. Complementary, not contradictory — each lens surfaces what the others miss.
Cost: Sonnet × 4 + Opus × 1 (Critic), ~$0.12 total. Time: ~4 minutes (parallel).
T+10 minutes (09:10:00) — Stage 4: Evidence & Audit clusters
Agent #8. Watches agent_postits for new rows clustering around a casefile. When five new post-its land within a 10-minute window, it fires.
- Acquires a per-casefile lock
- Reads all 5 new post-its + the existing
story.mdforchatbot-hallucinated-policy - Reads related historical incidents (other chatbot-hallucination patterns)
- Rewrites the casefile story:
~/library/stories/chatbot-hallucinated-policy/story.mdExcerpt:
# Pattern: AI chatbot fabricates binding policy commitments
**Pattern type:** LLM hallucination · customer-facing · contract-binding outcome.**Severity baseline:** 4/5 (direct financial + reputational).**Recurrence:** 3rd related incident in 18 months (precedents: INC-2025-0089, INC-2025-0144).
## What we now know- Without retrieval grounding, LLM customer-support agents hallucinate policies with non-trivial frequency.- Courts have started treating chatbot outputs as binding on the deployer.- Engineering-only controls are insufficient; legal + contractual + organizational controls also required.
## Required controls (from this incident's analyses)- [Root Cause] Retrieval grounding + confidence threshold + post-process filter- [Threat Model] Prompt-injection hardening + abuse-pattern monitoring- [Guardrail] 6 layered controls (preventive · detection · blocking)- [Dev Process] Pre-deploy test suite · CI canary · red-team · embed policy SME- [Critic] Cross-functional governance loop including legal + product + support
## Citations- post-its: [postit:88432, postit:88433, postit:88434, postit:88435, postit:88436]- source: incidents:INC-2026-0142- related: incidents:INC-2025-0089, incidents:INC-2025-0144Story archived to ~/library/stories/chatbot-hallucinated-policy/archive/2026-05-19T09-09-50.md (the previous version, in case anyone needs to audit when the analysis changed).
Cost: Sonnet, ~$0.02. Time: ~2 minutes.
T+next-day 06:00 (Wednesday 06:00:00) — Stage 5: Person-watcher writes the brief
A daily synthesis agent runs at 06:00. It reads:
- Every active casefile story (this morning, including the new one)
- Recent pinned facts
- Open action items in
user_tasks - Yesterday’s leave-notes from each Guardrail Lab agent
It writes one document — ~/library/final/daily-ai-safety-brief.md — capturing everything the office knows this morning.
Excerpt for this incident’s section:
## NEW · Chatbot hallucination as binding contract — 3rd recurrence in 18 months
Pattern recurred yesterday (INC-2026-0142, severity 4). Standard control setrequired: retrieval grounding · confidence threshold · output filter · canary ·human-on-refunds · prompt-injection hardening · cross-functional governance.
**Open action**: organizations deploying customer-support LLM should ship thecontrol bundle before next deployment review. Audit trail required.
**Related**: INC-2025-0089 (same pattern), INC-2025-0144 (same pattern). Trend.
**Critic flag**: recurrence rate suggests industry-wide gap, not single-org issue.Recommend escalation to AI safety governance body.This document is signed + hashed. The hash lives in the DB so cold-start can verify nothing tampered with the file.
Cost: Opus, ~$0.30 (daily, amortised across all active casefiles). Time: ~5 minutes.
T+next-day 09:00 (Wednesday 09:00:00) — The engineer wakes up
An NBS engineer opens Claude Code at their workstation. The Guardrail Lab agent — a single user-facing agent the engineer interacts with — wakes up. SessionStart hook fires.
The hook reads daily-ai-safety-brief.md and injects ~10K tokens of context into the agent’s system prompt — the agent already knows everything before the engineer says a word.
The engineer types “summarise yesterday’s incidents and prep a control checklist.” The agent doesn’t ask “which incidents?” or “what’s the context?” It just produces:
- A two-paragraph summary of INC-2026-0142
- The recurrence pattern flag (3rd in 18 months)
- A ready-to-circulate control checklist (Markdown, formatted, copy-paste)
- A draft escalation note for AI safety governance
All cited back through the chain to raw incident bytes.
Why this is the demonstration that matters
Three things this trace proves that nothing else does:
| Claim | How this trace proves it |
|---|---|
| The office reads through perspectives | Five post-its, five lenses, one source. Root Cause sees mechanism; Threat Model sees abuse vectors; Guardrail Designer sees controls; Dev Process sees SDLC integration; Critic sees what the others miss. None contradicts. |
| Stories cluster, not pile | Evidence & Audit takes five post-its and writes one coherent narrative connecting this incident to prior recurrences. Not a dump — a synthesis. |
| Cold-start is free for the engineer | The agent wakes up knowing yesterday’s incident, the pattern, the controls. No “explain the context” prompt. No “what’s the situation” handshake. |
| Provenance is real | Every claim cites a post-it which cites a source row which cites raw JSONL bytes. Anyone asking “where did you get this?” gets a verifiable chain back to ground truth. |
The cost end-to-end
For one incident through the full pipeline:
| Stage | Agent | Model | Cost | Time |
|---|---|---|---|---|
| 1. Incident Collector | #1 | Haiku | $0.0001 | 3s |
| 2. Converter | (built-in) | Haiku | $0.0005 | 10s |
| 3. 5 analyzers (parallel) | #2, #3, #4, #5, #9 | Sonnet × 4 + Opus × 1 | $0.12 | 4 min |
| 4. Evidence & Audit (Story-builder role) | #8 | Sonnet | $0.02 | 2 min |
| 5. Daily synthesis | (Person-watcher role) | Opus | ~$0.001 / incident | 5 min daily |
| Total per incident | ~$0.14 | ~6 min (mostly parallel) |
At 30 incidents / week, that’s ~$17/month for an entire AI safety incident-analysis office running continuously.
The 9 agents in this trace
| # | Agent | Role in this trace |
|---|---|---|
| 1 | Incident Collector | Pulled INC-2026-0142 from the feed |
| 2 | Root Cause | Wrote the technical-lens post-it |
| 3 | Threat Modeling | Wrote the adversarial-lens post-it |
| 4 | Guardrail Designer | Wrote the controls post-it |
| 5 | Dev Process | Wrote the SDLC-integration post-it |
| 6 | Policy-as-Code | (not invoked in this trace — engages when controls land in code) |
| 7 | Claude Hook | (not invoked — engages when SDK-specific guardrails needed) |
| 8 | Evidence & Audit | Clustered the post-its into a story with provenance |
| 9 | Critic | Wrote the gap-finding post-it (what the others missed) |
Agents #6 and #7 engage in the next phase — when the controls become code and CI checks. That extends the trace by ~30 minutes and produces deployable artifacts.