Skip to content

Worked example — one incident, nine agents, one report

A representative AI safety incident arrives at 09:00 Tuesday morning. By Wednesday 06:00 the AI Guardrail Lab has read it, analyzed it through five lenses, designed controls, captured evidence, and produced a ready-to-circulate report.

This page walks the trace end-to-end. It is the single best demonstration of what the 9-agent Guardrail Lab actually does.

The incident

Tuesday 09:00. Incident ID: INC-2026-0142. Source: OECD AI Incidents Monitor (AIID).

A pattern that recurs in real incident databases:

An AI-powered customer-support chatbot deployed by a mid-size airline confidently quoted a bereavement refund policy that did not exist. A grieving customer purchased an expensive last-minute ticket relying on the chatbot’s promise. The carrier later denied the refund. A small-claims court ruled the chatbot’s statement was binding on the carrier — refund must be honored. Public-facing reputational damage followed.

System type: LLM customer-support chatbot. Deployment context: post-purchase customer service. Harm type: misinformation · financial · reputational. Severity: 4/5.

A real-world pattern. The incident appears as one JSONL row in the dataset configured for the Guardrail Lab.

The trace

T+0 (09:00:00) — Incident lands

The Incident Collector watches a configured incidents feed (OECD AIID + AIAAIC + Stanford + Damien Charlotin tracker). A new row matching the configured filter (severity ≥ 3, public chatbot, last-72h) arrives.

~/incidents/raw/2026-05-19_INC-2026-0142.jsonl

Nothing else happens yet. The office is silent.

T+5 seconds (09:00:05) — Stage 1: Incident Collector fires

Agent #1 of the AI Guardrail Lab. Watching the feed, it picks up the new row.

  • Reads the JSONL record
  • Validates schema (required fields present)
  • Writes a row to the incidents table: id=INC-2026-0142, severity=4, system_type="LLM chatbot", deployment_context="customer support", harm=["misinformation","financial","reputational"], received_at=2026-05-19T09:00:00
  • Emits an event: incidents:INC-2026-0142 new

Cost: Haiku, ~$0.0001. Time: ~3s.

T+15 seconds (09:00:15) — Stage 2: Converter

The Converter normalises raw JSONL into an agent-readable .md sidecar.

~/library/readable/incidents/2026-05-19_INC-2026-0142.md

With frontmatter:

source_url: file:///incidents/raw/2026-05-19_INC-2026-0142.jsonl
source_row: incidents:INC-2026-0142
system_type: LLM chatbot
context: customer support
severity: 4
harm_types: [misinformation, financial, reputational]

The body becomes plain markdown describing the incident in 200-400 words — facts only, no analysis.

Cost: Haiku, ~$0.0005. Time: ~10s.

T+30 seconds → T+5 minutes (09:00:30 – 09:05:00) — Stage 3: five analyzers fan out

This is the most important stage. Five of the Guardrail Lab’s nine agents read the same .md through different lenses. They run in parallel.

Each writes a post-it — a structured note — to the agent_postits table. Same source_url, same casefile_anchor (chatbot-hallucinated-policy), different perspective.

Agent #2 · Root Cause — technical lens

perspective: root-cause
casefile: chatbot-hallucinated-policy
importance: 5 (critical)
body:
Technical root cause: LLM hallucination on under-trained policy domain.
System prompt did not include the actual refund policy text. Retrieval layer
was absent — chatbot answered from training-data priors rather than
authoritative sources. No grounding citations enforced. No confidence
threshold applied. Output post-processing did not detect the policy claim
for verification.
Contributing factors: deployment shipped without policy-domain test suite;
no canary on hallucination rate; no human-in-the-loop on refund claims.
source_row_ids: [incidents:INC-2026-0142]

Agent #3 · Threat Modeling — adversarial lens

perspective: threat-model
casefile: chatbot-hallucinated-policy
importance: 4 (high)
body:
Attack/abuse vectors enabled by this failure pattern:
- Prompt-injection: user could engineer the chatbot to promise more
favourable terms (jailbreak via emotional framing observed in this case
even without malicious intent).
- Adversarial extraction: attacker discovers chatbot accepts policy
fabrications → uses transcripts as leverage in disputes.
- Reputational denial-of-service: bad-faith actors flood support with
edge cases hoping chatbot fabricates costly promises.
Damage vectors: direct financial loss per honored fabrication +
reputational cascade once one ruling becomes public.
source_row_ids: [incidents:INC-2026-0142]

Agent #4 · Guardrail Designer — prevention lens

perspective: guardrail
casefile: chatbot-hallucinated-policy
importance: 5 (critical)
body:
Preventive controls:
1. Retrieval-grounded responses — every policy claim must cite an
authoritative document; no claim without citation is allowed to ship.
2. Domain-specific output filter — regex + classifier to detect
refund/policy statements; route to human review.
3. Confidence threshold — refuse to commit to policy details below 0.85
model confidence on a calibrated test set.
Detection controls:
4. Hallucination canary — daily sample of 100 chatbot responses scored by
ground-truth checker; alert if rate >2%.
5. Customer-facing disclaimer — every policy statement carries "verify with
a human agent for binding details" suffix.
Blocking controls:
6. Hard block on refund-amount commitments above $X without human approval.
source_row_ids: [incidents:INC-2026-0142]

Agent #5 · Dev Process — SDLC lens

perspective: dev-process
casefile: chatbot-hallucinated-policy
importance: 3 (medium)
body:
Where guardrails insert into the SDLC:
- Pre-deploy: policy-domain test suite (1000+ Q&A pairs from authoritative
docs) must pass at 95% accuracy.
- CI gate: hallucination rate on canary set must stay <2% rolling.
- Pre-release: red-team simulation including prompt-injection attempts on
refund/policy questions.
- Post-deploy: weekly review of customer transcripts flagged by detection
filter.
Required role additions: policy SME embedded in pre-deploy reviews;
on-call rotation for hallucination-canary alerts.
source_row_ids: [incidents:INC-2026-0142]

Agent #9 · Critic — gap-finding lens

perspective: critic
casefile: chatbot-hallucinated-policy
importance: 4 (high)
body:
What the prior four analyses missed:
- Legal layer: chatbot output as binding contract is a jurisdictional risk
needing legal review, not just technical guardrails.
- Contractual layer: customer ToS may need updating to clarify chatbot
statements are advisory, not binding. (Legal must decide whether such
clauses survive scrutiny — see Mata-v-Avianca pattern.)
- Organizational layer: who owns the chatbot? If product owns it but
support handles consequences, there's a misaligned-incentive gap.
- Audit layer: how do we evidence "we did our diligence" after an incident?
The Guardrail Designer's controls need an audit trail showing they ran.
Recommendation: cross-functional governance loop, not just engineering controls.
source_row_ids: [incidents:INC-2026-0142]

Five rows in agent_postits. Same source. Five readings. Complementary, not contradictory — each lens surfaces what the others miss.

Cost: Sonnet × 4 + Opus × 1 (Critic), ~$0.12 total. Time: ~4 minutes (parallel).

T+10 minutes (09:10:00) — Stage 4: Evidence & Audit clusters

Agent #8. Watches agent_postits for new rows clustering around a casefile. When five new post-its land within a 10-minute window, it fires.

  • Acquires a per-casefile lock
  • Reads all 5 new post-its + the existing story.md for chatbot-hallucinated-policy
  • Reads related historical incidents (other chatbot-hallucination patterns)
  • Rewrites the casefile story:
~/library/stories/chatbot-hallucinated-policy/story.md

Excerpt:

# Pattern: AI chatbot fabricates binding policy commitments
**Pattern type:** LLM hallucination · customer-facing · contract-binding outcome.
**Severity baseline:** 4/5 (direct financial + reputational).
**Recurrence:** 3rd related incident in 18 months (precedents: INC-2025-0089, INC-2025-0144).
## What we now know
- Without retrieval grounding, LLM customer-support agents hallucinate policies
with non-trivial frequency.
- Courts have started treating chatbot outputs as binding on the deployer.
- Engineering-only controls are insufficient; legal + contractual + organizational
controls also required.
## Required controls (from this incident's analyses)
- [Root Cause] Retrieval grounding + confidence threshold + post-process filter
- [Threat Model] Prompt-injection hardening + abuse-pattern monitoring
- [Guardrail] 6 layered controls (preventive · detection · blocking)
- [Dev Process] Pre-deploy test suite · CI canary · red-team · embed policy SME
- [Critic] Cross-functional governance loop including legal + product + support
## Citations
- post-its: [postit:88432, postit:88433, postit:88434, postit:88435, postit:88436]
- source: incidents:INC-2026-0142
- related: incidents:INC-2025-0089, incidents:INC-2025-0144

Story archived to ~/library/stories/chatbot-hallucinated-policy/archive/2026-05-19T09-09-50.md (the previous version, in case anyone needs to audit when the analysis changed).

Cost: Sonnet, ~$0.02. Time: ~2 minutes.

T+next-day 06:00 (Wednesday 06:00:00) — Stage 5: Person-watcher writes the brief

A daily synthesis agent runs at 06:00. It reads:

  • Every active casefile story (this morning, including the new one)
  • Recent pinned facts
  • Open action items in user_tasks
  • Yesterday’s leave-notes from each Guardrail Lab agent

It writes one document — ~/library/final/daily-ai-safety-brief.md — capturing everything the office knows this morning.

Excerpt for this incident’s section:

## NEW · Chatbot hallucination as binding contract — 3rd recurrence in 18 months
Pattern recurred yesterday (INC-2026-0142, severity 4). Standard control set
required: retrieval grounding · confidence threshold · output filter · canary ·
human-on-refunds · prompt-injection hardening · cross-functional governance.
**Open action**: organizations deploying customer-support LLM should ship the
control bundle before next deployment review. Audit trail required.
**Related**: INC-2025-0089 (same pattern), INC-2025-0144 (same pattern). Trend.
**Critic flag**: recurrence rate suggests industry-wide gap, not single-org issue.
Recommend escalation to AI safety governance body.

This document is signed + hashed. The hash lives in the DB so cold-start can verify nothing tampered with the file.

Cost: Opus, ~$0.30 (daily, amortised across all active casefiles). Time: ~5 minutes.

T+next-day 09:00 (Wednesday 09:00:00) — The engineer wakes up

An NBS engineer opens Claude Code at their workstation. The Guardrail Lab agent — a single user-facing agent the engineer interacts with — wakes up. SessionStart hook fires.

The hook reads daily-ai-safety-brief.md and injects ~10K tokens of context into the agent’s system prompt — the agent already knows everything before the engineer says a word.

The engineer types “summarise yesterday’s incidents and prep a control checklist.” The agent doesn’t ask “which incidents?” or “what’s the context?” It just produces:

  • A two-paragraph summary of INC-2026-0142
  • The recurrence pattern flag (3rd in 18 months)
  • A ready-to-circulate control checklist (Markdown, formatted, copy-paste)
  • A draft escalation note for AI safety governance

All cited back through the chain to raw incident bytes.

Why this is the demonstration that matters

Three things this trace proves that nothing else does:

ClaimHow this trace proves it
The office reads through perspectivesFive post-its, five lenses, one source. Root Cause sees mechanism; Threat Model sees abuse vectors; Guardrail Designer sees controls; Dev Process sees SDLC integration; Critic sees what the others miss. None contradicts.
Stories cluster, not pileEvidence & Audit takes five post-its and writes one coherent narrative connecting this incident to prior recurrences. Not a dump — a synthesis.
Cold-start is free for the engineerThe agent wakes up knowing yesterday’s incident, the pattern, the controls. No “explain the context” prompt. No “what’s the situation” handshake.
Provenance is realEvery claim cites a post-it which cites a source row which cites raw JSONL bytes. Anyone asking “where did you get this?” gets a verifiable chain back to ground truth.

The cost end-to-end

For one incident through the full pipeline:

StageAgentModelCostTime
1. Incident Collector#1Haiku$0.00013s
2. Converter(built-in)Haiku$0.000510s
3. 5 analyzers (parallel)#2, #3, #4, #5, #9Sonnet × 4 + Opus × 1$0.124 min
4. Evidence & Audit (Story-builder role)#8Sonnet$0.022 min
5. Daily synthesis(Person-watcher role)Opus~$0.001 / incident5 min daily
Total per incident~$0.14~6 min (mostly parallel)

At 30 incidents / week, that’s ~$17/month for an entire AI safety incident-analysis office running continuously.

The 9 agents in this trace

#AgentRole in this trace
1Incident CollectorPulled INC-2026-0142 from the feed
2Root CauseWrote the technical-lens post-it
3Threat ModelingWrote the adversarial-lens post-it
4Guardrail DesignerWrote the controls post-it
5Dev ProcessWrote the SDLC-integration post-it
6Policy-as-Code(not invoked in this trace — engages when controls land in code)
7Claude Hook(not invoked — engages when SDK-specific guardrails needed)
8Evidence & AuditClustered the post-its into a story with provenance
9CriticWrote the gap-finding post-it (what the others missed)

Agents #6 and #7 engage in the next phase — when the controls become code and CI checks. That extends the trace by ~30 minutes and produces deployable artifacts.