The Run That Wrote the Same Invoice Twice
A finance team in regional lending discovered their agent had processed the same disbursement instruction four times in eleven minutes. No error was logged. The agent had simply retried on a network timeout, and the downstream system accepted every call because nobody had made the write operation idempotent. The result was not a crash. It was silent, confident wrongness at four times the intended scale.
That is the failure mode that actually hurts in production. Not the dramatic exception stack that pages your on-call engineer at 2 a.m. The quiet one that runs to completion, returns a success status, and leaves corrupted state behind it. Building an agent that fails safely is entirely a design question, and most teams answer it too late, after the first real incident rather than before it.
Idempotency Is Not Optional It Is the Foundation
Every action your agent takes against an external system should be safe to repeat without changing the outcome beyond the first successful execution. This sounds obvious. It is almost never implemented fully on the first pass.
In a distribution context I worked in, agents were routing purchase orders to suppliers based on inventory signals. The agent would occasionally time out waiting for an acknowledgement from the supplier API, assume failure, and resubmit. The supplier system had no deduplication key on inbound orders. Within a week of go-live, the warehouse had received duplicate shipments on six SKUs. The fix was not in the agent logic. It was in requiring every write to carry a stable, deterministic request ID derived from the source record and the intended action, so the supplier system could recognise and discard a repeated call.
The pattern is the same whether you are writing to a ledger, updating a field service ticket, triggering a machine stop on a production line, or sending a customer notification. Generate the idempotency key before the action. Persist it. Pass it. If the downstream system does not support deduplication natively, build a thin wrapper that does. This is not glamorous engineering but it is the difference between a retry being safe and a retry being an incident.
Serialised Retries and Why Parallelism Is a Trap Under Failure
When an agent step fails, the instinct is to retry fast and in parallel to recover throughput. This is almost always wrong.
Parallel retries under partial failure create race conditions on shared state. In a manufacturing deployment where agents were updating work order status across a multi-site ERP, a transient database lock caused three concurrent agent threads to each believe they were the authoritative writer for the same work order. Each thread read stale state, computed a different next status, and wrote it. The work order ended up in a status that was not reachable by any valid workflow path. A human had to manually reconstruct the correct state from audit logs.
Serialized retries with exponential backoff and jitter are slower. They are also the only approach that is safe when your writes are not fully atomic or when your downstream systems have eventual consistency behaviour, which is most of the time in enterprise environments. Set a maximum retry count. Set a maximum elapsed time. When you exceed either, stop and route to a dead-letter queue or a human review step. Do not keep retrying indefinitely. An agent that cannot give up is more dangerous than one that fails fast.
Loop Filters Are How You Stop an Agent From Eating Itself
Agents that operate on event streams or polling queues can enter loops where the output of one step becomes the input that triggers the same step again. This is not a theoretical edge case. I have seen it happen in field operations, in finance reconciliation, and in distribution exception handling.
A field operations agent was designed to detect unresolved service tickets older than a threshold and escalate them. The escalation action updated a timestamp field on the ticket. The agent's polling query included that timestamp field in its filter logic. Every escalation pushed the ticket back into the active window. The agent escalated the same tickets in a tight loop for six hours before anyone noticed the notification volume.
The fix requires explicit loop detection at the agent level, not just at the system level. Before acting on a record, check whether this agent instance, or any recent instance, has already acted on this record in this run cycle. Maintain a short-lived action log keyed by record ID and action type. If the log shows a recent action, skip and log the skip rather than acting again. The window for this check should be longer than your longest expected run cycle, not shorter. In most operational contexts, a 24-hour deduplication window is a reasonable starting point.
Recovery Logic Is What Separates a Shrug From an Incident
Not every failed run needs human intervention. The decision about which failures are self-recoverable and which require escalation is one of the most important design choices you will make, and it needs to be explicit, not implicit.
In a lending operations deployment, agents were processing document verification steps as part of loan origination. The design team initially set up a simple binary, success or failure, with all failures routing to a human queue. Within a week the human queue was flooded with transient API timeouts that the agent could have retried safely. Reviewers were spending most of their time clearing noise rather than handling genuine exceptions.
The right model is a tiered classification of failure types. Transient infrastructure failures, network timeouts, rate limit responses, temporary service unavailability, are candidates for automatic retry with no human involvement. Validation failures where the input data is malformed or missing required fields should be routed to the data owner, not to a technical reviewer. State conflicts where the agent finds the record in an unexpected condition should be escalated to a domain expert who understands the business process. Security or permission failures should immediately halt the run and alert the team regardless of the hour.
Write this classification down before you build. Make it a first-class artifact of your agent design, as explicit as your data schema. The model powering your agent will not always give you a clean signal about which category a failure falls into, so the classification logic needs to live in deterministic code around the model, not inside the model's reasoning.
Observability Is Not Logging It Is Knowing What the Agent Actually Did
Most teams instrument their agents for performance. Latency, token counts, step durations. Very few instrument them for correctness at the action level. These are different things and the second one is the one that matters for safe failure.
For every write action an agent takes, you need a durable record that captures the record identifier, the action taken, the before state if you can capture it, the after state, the agent run ID, the timestamp, and whether the action was a first attempt or a retry. This is your audit trail and it is also your recovery tool. When something goes wrong, and it will, this log is what lets you reconstruct what happened, identify the blast radius, and reverse or correct the affected records without guessing.
In a distribution context, an agent was updating shipment priority flags based on customer tier signals. A bug in the tier classification logic caused a batch of standard shipments to be flagged as priority. Because the team had full action-level logging, they could identify every affected shipment record within minutes, produce a correction script that reversed only those specific changes, and verify the reversal was complete. Without that log, the recovery would have required a full audit of the shipment table against the source system, which would have taken days.
Build the action log first. Build the dashboards second. The dashboards are useful. The action log is essential.
The Practical Takeaway for Your Next Agent Build
Before you write a single line of agent logic, answer these five questions in writing and get agreement from your operations stakeholders.
- What is the idempotency key for every write this agent performs, and does the downstream system honour it?
- What is the maximum number of retries and elapsed time before this agent stops and routes to a dead-letter queue?
- What is the deduplication window that prevents this agent from acting on the same record twice in one cycle?
- What are the three or four failure categories this agent can encounter, and which category gets automatic retry, which gets routed to a domain expert, and which halts the run immediately?
- What does a complete action log entry look like for this agent, and where is it stored in a way that survives a failed run?
If you cannot answer all five before you build, you are not ready to run this agent in production. The model at the centre of your agent is capable and it is also capable of being confidently wrong. The plumbing around it is what decides whether that wrongness is a recoverable blip or a three-day data recovery project. That plumbing is your job, not the model's.
