Notary
Back to blog

Our AI Agent Did Something Weird and We Can't Explain It: A Field Guide to the Incident Review

By Notary Team

Our AI agent did something weird and we can't explain it. That sentence, give or take a few words, is the most common Monday morning Slack message in AI platform teams right now. A customer screenshot lands in a support channel. A sales engineer forwards a screenshot of a quote the pricing agent should not have offered. A risk officer flags a loan decision that does not match the documented policy. Someone adds the platform lead to the thread. The platform lead opens the usual dashboards, scrolls, and sends the reply everyone dreads: we're looking into it.

This post is a field guide to what happens next. Not the version where everything is instrumented perfectly and the answer falls out of a single query. The realistic version, where the logs are scattered, the retention is short, the model versions have rolled forward twice since the incident, and the person asking the question is a regulator, a board member, or opposing counsel. How do you run the incident review when your current tooling cannot give you a confident answer, and how do you make sure the next weird thing is explainable on the spot.

Why "our AI agent did something weird and we can't explain it" keeps happening

The frequency of this message is not a sign that AI agents are uniquely unreliable. It is a sign that the evidence layer underneath agents has not caught up to how they are deployed. Three things stack together to produce the pattern.

First, agents are non-deterministic by design. The same input can produce materially different outputs across runs, across model versions, and across provider-side changes the customer never sees. A developer who is used to reproducing bugs by re-running the failing path hits a wall when re-running produces a different result. "Weird" becomes a category of behaviour, not a single bug.

Second, the context window that produced the weird behaviour is rarely captured in full. Most teams log the user prompt and the final model output. Few log the system prompt as it stood at that moment, the retrieved context, the tool call arguments, the intermediate tool responses, the temperature setting, the exact model version, or the full conversation history leading up to the weird turn. Without those, reconstruction is guesswork.

Third, agents touch multiple systems. A single "action" might involve an OpenAI chat completion, a retrieval against a vector store, three tool calls into internal APIs, a handoff to a different agent running on Anthropic, and a final write into a database. Each hop has its own log stream, its own retention policy, and its own schema. Joining them into a single coherent narrative is a forensic exercise, not a dashboard query.

The result is predictable: when the weird thing happens, nobody on the team has a complete, ordered, trustworthy record of what actually occurred. So the incident review becomes a reconstruction exercise under time pressure, and the reconstruction is, at best, plausible rather than provable.

The five questions the incident review has to answer

Independent of how you do the work, a complete incident review for an agent has to answer five questions. If you cannot answer all five, you do not have a review, you have a narrative. Narratives do not survive a regulator's follow-up, a plaintiff's expert, or a board member who wants to know how this will not happen again.

What exactly did the agent receive as input? Not a summary. The verbatim user prompt, the full conversation history, the system prompt as it stood at that moment, the retrieved context chunks in the order they were injected, any tool outputs fed back into the context, and every configuration value that shaped the call: model name, model version, temperature, top-p, stop sequences, tool definitions.

What exactly did the agent produce as output? Again, verbatim. The full model response, including any chain-of-thought if your deployment captures it, every tool call the model emitted, every tool response the orchestrator fed back, and the final action taken on any downstream system.

When did each of those happen, in what order, and can you prove the ordering? Real incidents hinge on ordering. Did the retrieval happen before or after a policy update? Did the tool call happen before or after the user clarified their request? Wall-clock timestamps on log lines are not enough because they can be backdated or drift across systems. You need a chronology you can defend.

Who or what could have modified the record between the event and your reading of it? If the answer is "anyone with production access," your record is not evidence, it is a plausible reconstruction. The strength of your answer to this question determines whether your incident review is useful for internal learning only, or whether it can also be used in a regulator response, a board memo, or a litigation discovery pack.

How does this chain of events compare to the policy the agent was supposed to follow? The policy lives somewhere: in a system prompt, in a guardrails document, in a set of tool descriptions, in a test suite. The incident review is not complete until you have explicitly compared what happened to what was supposed to happen and identified where the divergence started.

For most teams, questions one, two, and three are the hard ones. Question four is the one most often ignored and most often fatal under scrutiny. Question five is the one that turns a post-mortem into a policy change.

What the review actually looks like when the evidence is missing

If your current stack cannot answer those five questions cleanly, the review proceeds on a triage basis. Here is the sequence most teams end up running.

Hour one: freeze what you have

As soon as the weird thing is flagged, place a hold on the available logs before retention rolls them out. Export the last thirty days from your LLM provider dashboards, dump the relevant Datadog or Honeycomb traces to a separate bucket, snapshot your application audit table, and preserve the raw request and response bodies from any orchestration layer. Do this before you start analysing. Many teams skip this step and discover, three days into the review, that their default retention has already expired the traces they needed.

Hour two to four: pin the agent configuration at the moment of the event

The most common mistake in agent incident reviews is analysing the current configuration rather than the historical one. The prompt template has changed. The system prompt has been updated. A new tool was added. The model version ticked forward on the provider side. Before you look at the traces, reconstruct what the configuration was at the moment of the event. This usually means pulling deployment metadata, prompt template version control history, tool registry snapshots, and, if you are lucky, the configuration hash that was emitted alongside each request. If you are not lucky, it means a series of increasingly specific questions to the engineers who were on call that day.

Day one to two: reassemble the trace

With the logs preserved and the configuration pinned, reconstruct the full trace. This is where the joining work bites. OpenAI's request ID does not show up in your Datadog trace. Your Datadog trace does not carry the Anthropic completion ID for the handoff. Your internal tool call logs do not have either of them. You end up matching on timestamps, user IDs, and conversation IDs, hoping the clocks were close enough and no system dropped a record. If you built your stack without a single correlation ID that flows through every hop, this step takes the longest and produces the least confident output.

Day two to four: compare to policy and form a hypothesis

Once the trace is assembled, compare it line by line to the intended policy. Where did the agent diverge? Was the divergence in the retrieval (wrong context surfaced), in the reasoning (the model interpreted the context in an unexpected way), in the tool call (the tool was called with the wrong arguments), or in the execution (the downstream system did something the agent did not expect)? Each divergence point has a different fix, and the incident review is not complete until you have named which one applies.

Day four to five: write it up, honestly

The write-up has to state, explicitly, what you know, what you do not know, and what you cannot know because the evidence was not captured. The temptation to smooth over the gaps is enormous. It is also how teams end up in front of a regulator holding a document that does not match reality. Every "we believe" and "likely" in the write-up is a place where your evidence layer was insufficient. Those are the prioritised fix list for the next cycle.

Turning "our AI agent did something weird and we can't explain it" into "we have a full record"

The incident review above is what teams run today because it matches the evidence they currently collect. The goal is to make the next one structurally different. When the next weird thing happens, the message should read: "Our AI agent did something weird. Here is the full record. Here is what it did. Here is why. Here is the fix."

That requires four changes in how the evidence is captured, independent of which vendor you use.

Single correlation ID across every hop. Generate an ID at the entry point of the agent flow and propagate it through every model call, every retrieval, every tool call, every downstream write. The ID should appear in every log line produced by every system touched. This is the single highest-leverage change, and it costs almost nothing to implement, but it requires a coordinated sweep across every service.

Full-fidelity capture at the agent boundary. At the point where your code calls the model provider, capture the complete request and complete response, including the system prompt, the full message history, the tool definitions, the model version, all sampling parameters, every tool call, every tool response, and any provider-returned metadata. Store this separately from your application logs, in a system whose retention is set by legal policy rather than observability cost. If your application log retention is ninety days and your framework retention needs to be seven years, you need two storage tiers.

Cryptographic integrity on the historical record. Sign each captured record at the point of capture, using a key not accessible to production operators. Chain the records so that deletion or modification is mathematically detectable. Timestamp each record against an external authority so that the ordering cannot be challenged. This is the capability that moves your review output from "plausible reconstruction" to "defensible record."

A configuration snapshot per action. Alongside each captured request, store the exact version of every prompt template, guardrails document, tool definition, and deployment configuration that was in force when the action ran. The cheap way to do this is to hash the full configuration bundle and store the hash with the request, while separately archiving each unique configuration bundle by hash. Reconstructing the configuration at the moment of the event then becomes a lookup, not an archaeology project.

Teams that make these four changes do not stop having weird incidents. They stop having unexplainable weird incidents. The review timeline compresses from a week to an afternoon. The write-up reads as a factual account rather than a hedged hypothesis. The question from the regulator or the board lands as an email, not a crisis.

A note on who gets asked the question

The last thing worth saying is about the human dynamic. "Our AI agent did something weird and we can't explain it" is almost never a technical question when it arrives at the platform lead's desk. It is a question from someone who needs an answer in a form they can relay further up the chain. The sales engineer needs to reply to the customer. The risk officer needs to file a disposition. The general counsel needs to brief the CEO. The board member needs to satisfy their own fiduciary duty.

The value of a proper evidence layer is that it lets the platform lead turn a crisis into a forwardable artifact. "Here is the full record of what happened, signed and timestamped. Here is our written analysis. Here is the fix we have deployed. Here is the evidence pack if anyone external asks." The technical work and the organisational work are the same work, because the organisational work runs on the output of the technical work.

If you are the person who has been getting these messages, and the review process described above feels uncomfortably familiar, the shortest path out is to stop trying to make the next incident faster to reconstruct and start making the next incident reconstructable by default. That is what an evidence layer does. The Notary docs walk through the capture architecture in detail, and the agent audit trail overview shows what a single agent action looks like once the five questions above are answerable on demand.

Monday morning will still bring weird things. It does not have to bring "we can't explain it."