Skip to content

Ch 20: Failures, Retries, Recovery, and Observability

You have unlocked secret 0 of 20 about AI Agents
0%
The Control Room

Every complex operation needs a control room. Power plants have them. Air traffic control has them. Mission control at NASA has them. The control room is where you go when you need to answer one question: what is happening right now?

Your system is no different. You have a team leader breaking work into pieces, team members carrying out tasks in their own workspaces, finished work flowing through review, specialists joining and leaving, and scheduled routines running on their own. Dozens of moving parts, all running at the same time.

When everything works, you do not need the control room. But everything does not always work. A team member goes silent. A finished piece of work disappears. Tasks take ten times longer than yesterday. Without a control room, you are blind — staring at thousands of scribbled notes everyone wrote down, guessing.

Observability is the control room. It has three instruments: the event timeline (what happened, in what order), the dashboard dials and gauges (how fast, how often, how much), and the warning lights and alarms (when something crosses a line). Together, they make the invisible visible. They turn "something is wrong" into "this is why, and here is how to fix it."

The 7 Things That Go Wrong

Every complex operation — no matter how well designed — will run into these seven problems. They are not one-time mistakes you fix and forget. They are forces of nature that your system must handle gracefully, every time, forever.

  1. Someone took too long to respond. You asked a team member to do something, and they went silent. The clock ran out. Everything waiting on their answer piles up.

  2. Too many requests at once. You sent too many messages too fast. The other side says "slow down" and stops answering. If you keep pushing, they shut the door entirely.

  3. The reply comes back scrambled. You asked for a clear answer, but what came back is garbled or cut off halfway through. The system cannot make sense of it.

  4. An errand fails. You sent someone to fetch a file, run a task, or look something up — and it did not work. The file was missing. The task broke. The record room was locked. Now someone needs to decide: try again, skip it, or ask for help.

  5. The notepad is full. The helper has been working so long that their notepad is completely full. They start forgetting earlier instructions. They lose track of what they were doing.

  6. Going in circles. The helper keeps doing the same thing over and over, getting the same result, and trying again. The safety guardrail catches it — but only if you set one up.

  7. Locked door. The helper tries to open a door they do not have the key for — a protected area, a restricted resource. The safety gate blocks them, but the helper needs to understand why and find another way.

Each of these problems is recoverable. The difference between a hobby project and a real operation is not whether failures happen — it is whether the system handles them on its own, records what went wrong, and sounds the alarm for the right people.

Narrator

You are the night-shift supervisor in the control room. It is 2 AM. A warning light starts flashing: the task completion rate has dropped from 95% to 12% in the last 15 minutes. Time to figure out what is going on and fix it.

Put the incident response steps in the correct order

Drag to reorder, or use Tab + Enter + Arrow keys.

  1. Spot the problem on the dashboard or hear the alarm
  2. Read the event timeline to find the root cause
  3. Apply the fix
  4. Watch the dials to confirm things are back to normal
  5. Add new warning lights to catch it earlier next time

Key Insight

A control room is not just a pile of notebooks. Notebooks are "someone wrote down what happened." A control room is "I can ask any question about what is going on — even questions I never thought to ask ahead of time."

A notebook entry might say: "Phone line timed out." That is a fact. But it does not tell you:

  • How many timeouts happened in the last hour?
  • Are they all coming from one team member, or spread across everyone?
  • Did they start after the latest change went live?
  • Does the timeout rate go up when the system is busier?

A real control room answers all of these questions because it captures organized information — the story of each event, dashboard readings with labels, and records with timestamps — that can be sorted, filtered, and combined after the fact.

The three instruments work together:

  • The event timeline tells you the story of a single task: what happened, in what order, and how long each step took.
  • The dashboard dials and gauges tell you the big picture: how the system is performing overall, right now and over time.
  • The warning lights and alarms tell you when something crosses a line that needs attention.

A system with great notebooks but no dashboard is like having a library with no catalogue. The information is in there — but finding it takes hours. A system with a great dashboard but no event timeline gives you the "what" but not the "why." A system with all three gives you the full control room: you see the problem on the dashboard, read the timeline to understand it, and fix the root cause.

Congratulations

You have built a complete system from scratch.

Twenty chapters. From a genius locked in a room, sliding notes under a door — to a full operation with team leaders, team members, traffic lanes, private workspaces, review gates, relay handoffs, specialist skills, scheduled routines, a shared language for outside helpers, clear answer formats, double-checking loops, trust stamps and batch processing, and now, a control room that makes the invisible visible.

Here is every piece you assembled:

  1. The Loop — the basic cycle of asking, checking, and doing that powers everything.
  2. The Toolbox — a well-organized collection of tools anyone can find and use.
  3. The Blueprint — breaking a big request into a step-by-step plan.
  4. The Notebook — giving the helper a persistent, searchable memory.
  5. Safe Editing — careful, structured changes instead of blind overwrites.
  6. Safety Gates — guardrails that prevent dangerous actions.
  7. Private Workspaces — everyone works in their own space, so no one steps on anyone else's work.
  8. Traffic Lanes — controlling how many people work at the same time.
  9. The Team — a leader who delegates tasks and team members who carry them out.
  10. The Schedule — running tasks in the right order, respecting who depends on whom.
  11. The Review Gate — a checkpoint where a human looks over the work before it moves forward.
  12. The Relay Handoff — passing the baton smoothly when one helper's shift is over.
  13. Specialist Skills — calling in an expert for a specific job, then letting them leave.
  14. Scheduled Routines — things that happen automatically when certain events occur.
  15. The Shared Language — a common way to talk to outside helpers and services.
  16. Clear Answers — getting back replies in a neat, predictable shape.
  17. Double-Checking — catching mistakes and fixing them before they go further.
  18. Trust Stamps and Batching — knowing where every result came from, and handling many at once.
  19. Final Assembly — wiring every piece together into one working system.
  20. The Control Room — making the invisible visible so you can watch, fix, and improve.

These are the same patterns used by every serious system in the real world. The details differ. The scale differs. But the shape — the loop, the tools, the plan, the memory, the safety layers, the teamwork, the control room — is the same.

You are no longer just a user of these systems. You understand how they work. You can build them, take them apart, and make them better.

Go build something.