How Do We Prove What Our AI Agent Did?
By Notary Team
How do we prove what our AI agent did? The question sounds simple until someone important asks it. Your general counsel wants to know what the pricing agent said to an enterprise customer on Tuesday afternoon. Your auditor wants evidence that the hiring screener did not reject candidates based on a protected class. A regulator sends a letter giving you seventy-two hours to produce a full record of every decision the claims bot has made since January.
"We have logs" is the answer you want to give. It is almost never the answer that holds up.
Why "we have logs" is not the same as "we have proof"
Every team running AI agents in production has logs somewhere. Most have too many places. OpenAI's dashboard shows tool calls in one shape. Anthropic's shows them in another. LangSmith captures spans. Datadog ingests application traces. CloudWatch has the Lambda invocations. And somewhere, maybe, a homegrown audit table that someone on the platform team spun up when SOC 2 came knocking.
The problem is not that these logs do not exist. The problem is that none of them, individually or collectively, answer the question your lawyer is actually asking.
When a general counsel asks how do we prove what our AI agent did, they are not asking for a stream of JSON. They are asking for something that will survive adversarial scrutiny: a sworn deposition, a regulator's technical review, a plaintiff's expert witness, a SOC 2 Type II audit. Four different standards. Four different failure modes. All of them land on the same four requirements.
Can you show the exact inputs and outputs of every agent action? Can you prove the record has not been modified since the event? Can you demonstrate a chain of custody back to the moment of execution? Can you produce the record on demand in a format the other side will accept?
Most teams can answer none of those cleanly. Some can do one. Almost nobody has the full stack.
The four scenarios where this actually gets asked
This is not a theoretical problem. It shows up, in the same shape, across four recurring situations.
The regulator letter
An EU data protection authority, the CFPB, the FTC, or a state insurance commissioner opens an inquiry. Your compliance officer forwards the letter at 4pm on a Friday. The letter cites Article 12 of the EU AI Act, or Section 1002.7 of Regulation B, or a specific line in the state's AI hiring statute. It asks for all records of model decisions relating to a named class of users between two specified dates. You have until end of business next Friday to produce.
The first question your outside counsel asks is whether the records you hand over are complete, tamper-evident, and admissible. If the answer is "we pulled them from Datadog this morning," you are about to spend a lot of money on a forensic engagement.
The SOC 2 finding
Your auditor runs a walkthrough of the agent-driven workflow. They ask to see evidence that tool calls are logged, that the log store is append-only, that retention meets your documented policy, and that access is restricted to a named group. You point them at the Datadog dashboard. They note that Datadog retention is ninety days, that the log store is not append-only, and that anyone with Datadog admin can delete a log line. They write up a finding.
You now have a gap to close before the Type II window closes. The clock is measured in weeks.
The board memo
A board member reads about the latest agent-gone-wrong story: a chatbot that promised refunds the company did not authorize, an underwriting model that learned the wrong correlation, a customer service agent that hallucinated a policy that does not exist. The next board meeting has an item added: "AI governance: evidence of control." The CTO has to show that for every agent in production, the company can produce a complete, trustworthy record of what the agent did and why it did it.
"Trustworthy" here does not mean "we trust it." It means "a skeptical third party would trust it."
The subpoena
A former customer sues, alleging the agent denied them a loan or a claim or a job. Their lawyer serves a discovery request. Rule 34 of the Federal Rules of Civil Procedure is not a suggestion: you must produce what is asked for, in a reasonably usable form, preserved under a litigation hold. The plaintiff's side will ask your CISO to sign a declaration about the integrity of the records. If the CISO cannot sign in good faith, they will not sign, and your case gets materially worse.
What most teams actually have
Pull the thread on any of those scenarios and you usually find the same pattern.
There is an LLM provider dashboard (OpenAI, Anthropic, Vertex, Bedrock) that shows the last thirty days of activity with limited export, no cryptographic integrity, and no consistent schema across providers. There is an observability tool (Datadog, New Relic, Honeycomb) tuned for latency and error-rate debugging, with retention set by cost rather than by legal policy, and with log lines that can be edited or dropped by anyone with the right IAM role. There is a homegrown audit table that captures a subset of fields, often missing the full prompt, the tool arguments, the model version, or the system prompt at the time of execution.
None of these are wrong for what they were built for. They were built for debugging and uptime. Not for evidence.
An agent audit trail, in the sense that lawyers and regulators use the phrase, is something different. It is a cryptographically signed, tamper-evident, chronologically ordered record of every input and output, stored in an append-only system, with retention tuned to the longest applicable legal hold, exportable on demand into packs mapped to specific frameworks. That is a fundamentally different product from an observability tool. Confusing the two is how teams end up in the SOC 2 finding in the first place.
How do we prove what our AI agent did: the five-part defensible answer
When someone asks the question, the answer that actually holds up in front of an auditor, a regulator, or opposing counsel has five parts.
First, the input. The exact user prompt, the system prompt, any retrieved context, any tool arguments, the model name, the model version, the temperature, the full configuration. Captured at the moment of execution, not reconstructed after the fact from a half-remembered deployment.
Second, the output. The full model response. Every tool call the agent made. Every downstream API invocation. Every state change. Captured verbatim, not summarised.
Third, the timestamp. Not a local server clock, which anyone can backdate. A signed timestamp from an RFC 3161 timestamp authority, which gives you a cryptographic attestation that the record existed at a specific moment in time. An adversarial expert can challenge a log line's timestamp field. They cannot challenge an RFC 3161 token.
Fourth, the chain. Each record cryptographically linked to the one before it using a hash chain or Merkle tree structure, so that any modification anywhere in the history breaks the chain mathematically. This is what "tamper-evident" actually means. It is not a label you put on a log store. It is a property the log store can prove on demand.
Fifth, the export. A bundle you can hand to a regulator or a court, mapped to the specific framework in play: SOC 2 CC7.2, EU AI Act Article 12, NIST AI RMF Measure 2.8, HIPAA Security Rule 164.312(b). Not a CSV of log lines. A signed evidence package with a chain of custody affidavit attached, deposition-ready.
If you have those five parts, Federal Rule of Evidence 901 (authentication) and 902 (self-authenticating records) start to look very different. You have something that holds up under cross-examination, not something you have to explain away.
Monday morning: where to start
You do not need to solve this all at once. But you do need to start somewhere that actually moves the problem.
Pick one agent in production, ideally the one your general counsel is most nervous about. For that agent, document three things: what inputs you currently capture, what outputs you currently capture, and where those records live today. Then walk them against the five-part list above. Every gap is a piece of the answer you cannot currently give.
Most teams discover that they are capturing enough data. The data is just in the wrong shape, in the wrong place, with the wrong integrity guarantees, with the wrong retention policy, spread across too many systems. That is a solvable problem. "We do not capture anything" is a harder problem, but it is also a rarer one.
The follow-up question, which we will pick up in later posts, is what the evidence layer actually looks like when you build it properly: cross-provider ingestion that normalises OpenAI, Anthropic, Vertex, and Bedrock into one schema; cryptographic signing at the point of ingestion; append-only storage with retention tuned to the longest applicable hold; and export packs mapped one-to-one against the frameworks your auditors and regulators care about. Notary exists because the teams running agents in production kept asking that same follow-up question and finding that the observability vendors and the GRC vendors were both pointing at each other.
For now, if someone asks you on Monday morning how do we prove what our AI agent did, have a real answer ready. "We have logs" is not one.
If you want to see what the evidence layer looks like in practice, the Notary docs walk through the architecture, and the EU AI Act evidence pack shows exactly what Article 12 compliance looks like when it is built in from day one rather than bolted on under regulator pressure.