Notary
Back to blog

We Can't Reconstruct What Our Agents Did Last Quarter

By Notary Team

We can't reconstruct what our agents did last quarter is the kind of sentence teams say quietly, usually after somebody important asks an uncomfortable question. Internal audit wants a sample of customer-facing agent actions from January. Outside counsel wants every decision path for a claims workflow that is now in dispute. A regulator asks for records covering a fixed date range, and the platform team discovers that what looked like good logging in real time has dissolved into fragments.

At first, this sounds like a retention problem. Usually it is not. Most teams retained something. The real failure is that they retained the wrong things, in the wrong places, with no stable chain of custody and no practical way to reassemble the whole story. Quarter-old agent records fail for the same reason quarter-old incident timelines fail: the systems were built for debugging live traffic, not for reconstructing history under pressure.

That distinction matters. If you cannot reconstruct what your agents did last quarter, you do not just have a tooling gap. You have a governance gap that will show up in SOC 2 walkthroughs, insurance underwriting, EU AI Act record-keeping reviews, HIPAA audit requests, and civil discovery. The problem is not whether some logs exist. The problem is whether you can produce a complete, trustworthy record that a skeptical third party would accept.

Why "we can't reconstruct what our agents did last quarter" is not just a retention problem

The common assumption is that longer retention would solve this. Buy another ninety days in Datadog, turn up the CloudWatch policy, or export traces to cold storage, and you are done. That assumption is comforting and wrong.

Reconstruction fails long before retention expires. The prompt is in one system. Tool-call arguments are in another. The model version changed twice since the action happened. Retrieved context was never captured at all. Anthropic logs use one shape, OpenAI tool calls another, and your in-house orchestration layer wrote only a redacted summary because full payload logging was considered too expensive. Three months later, every team can point to a piece of the story, and nobody can produce the story itself.

This is the contrarian point most teams miss: quarter-old AI evidence is usually lost through fragmentation, not deletion. You can keep logs for seven years and still be unable to answer a basic question about a single agent action on a single day. Retention matters, but reconstructability matters first.

That is why frameworks that sound abstract on paper become painfully concrete in practice. SOC 2 CC7.2 asks whether you monitor system components and take action on anomalies. Monitoring implies records you can actually examine. The HIPAA Security Rule at 45 CFR 164.312(b) requires audit controls that record and examine activity in systems containing ePHI. The EU AI Act's Article 12 record-keeping duty is not satisfied by saying the records once existed in a vendor console with a thirty-day window. Federal Rule of Civil Procedure 34 does not ask whether your observability pipeline felt complete in February. It asks what you can produce now.

What usually breaks when teams try to reconstruct a quarter-old agent action

When a company tries to answer a quarter-old question, the same failure modes show up over and over again.

Provider drift

The agent called OpenAI for some workflows, Anthropic for others, and maybe Vertex AI for one pilot that became production without anyone quite announcing it. Each provider emits records in a different format. If nobody normalized those records when they were created, reconstructing them later becomes a translation project at the worst possible time.

Configuration drift

The model version, system prompt, tool registry, retrieval configuration, and guardrail settings changed after the event. What survives in the logs reflects the current system, not the system that existed when the action happened. That makes later reconstruction look cleaner than reality, which is exactly what opposing counsel or an auditor will attack.

Context loss

Teams often capture the final model output but not the retrieved documents, intermediate tool results, or user-visible side effects. That means you can show what the agent said but not why it said it, or what it relied on. In a discovery dispute or fairness review, that gap is fatal.

Integrity gaps

The records are technically present, but there is no cryptographic proof they were not modified after the fact. An export from Splunk or Datadog may be useful operationally, but it does not answer the adversarial question: how do we know this was not altered by an operator with admin access last week?

Retention mismatches

Legal and compliance timelines run in years. Observability timelines run in days or months. If your product team tuned retention around storage cost, you are using an uptime budget to solve a record-keeping problem. That usually ends badly.

The five records you need if you ever want to answer the question cleanly

If you want to reconstruct what your agents did last quarter, there are five record classes that need to exist at execution time, not after the incident. This is the section to take into a meeting.

1. Execution record. The exact user input, system prompt, model identity, model version, configuration, and timing for the action. Not a summary. The exact record.

2. Context record. Every retrieved document, memory lookup, policy snippet, or structured input that influenced the action. If retrieval-augmented generation was in play, this is where most teams discover they have a hole.

3. Tool record. Every tool call, argument set, downstream API invocation, result payload, and side effect. If the agent opened a Zendesk ticket, updated Salesforce, sent an email, or wrote to a database, that needs to be part of the record.

4. Integrity record. A signature, timestamp, and chain reference proving the record existed when claimed and has not been modified since. RFC 3161 timestamps, hash chaining, and public-key verification are what turn an event record into evidence.

5. Export record. A reproducible bundle that can be handed to internal audit, outside counsel, or a regulator without someone manually stitching screenshots together the night before. If you cannot export it cleanly, you do not really have it.

Most teams have part of one, pieces of three, and none of four. That is why quarter-old reconstruction feels expensive and unreliable even before a third party gets involved.

Why observability tools break down here

This is the point where someone says, reasonably, that Datadog, Splunk, CloudWatch, Honeycomb, or LangSmith already has most of the raw data. Sometimes that is true. It still does not solve the problem.

Observability systems optimize for troubleshooting production behavior. They are built around aggregation, search, sampling, redaction, retention by cost, and flexible operator access. Those are good properties for operating software. They are weak properties for proving history.

A quarter-old reconstruction request stresses exactly the dimensions observability tools de-prioritize. You need full-fidelity payload capture, not sampled traces. You need schema stability across providers, not per-team logging conventions. You need retention tied to legal hold and policy, not spend. You need tamper evidence, not simply IAM. You need a chain of custody, not a dashboard link.

That is the deeper category mistake. Observability answers, what is happening right now and why is latency up. Evidence answers, what happened on January 14 at 3:14 PM, what did the agent rely on, what changed in the world because of it, and can you prove the record is authentic. Those are not the same product requirements.

How to fix "we can't reconstruct what our agents did last quarter" before the next request arrives

The fix is not to ask engineers to log more random fields. The fix is to design for reconstructability as a first-class output of the system.

Start with ingestion. Every provider and orchestration layer your agents touch needs to feed a normalized evidence schema at the point of execution. OpenAI, Anthropic, internal tools, queue workers, retrieval systems, and downstream application actions should land in one record model, versioned over time. If you normalize later, under request pressure, you will normalize badly.

Then solve integrity at capture time. Sign records as they are created, not once a day in a batch job. Use an RFC 3161 timestamp authority instead of trusting a mutable server clock. Chain records so that deletion or modification is mathematically detectable. The standard to aim for is simple: if a skeptical expert examines the record six months later, verification should not depend on trusting your admins or your vendor's internal assurances.

Next, separate retention policy from observability spend. Your agent evidence store should reflect the longest applicable requirement across your frameworks and legal exposure. For some teams that means years. For others it means a legal hold that can freeze records indefinitely. Either way, it should not be a side effect of whatever plan the platform team bought for traces.

Finally, make export a product, not a heroics exercise. If internal audit asks for a quarter's worth of pricing-agent decisions affecting a named customer segment, the output should be a defensible package. Not screenshots. Not copied JSON in a spreadsheet. A package with the records, signatures, timestamps, and an explanation of custody.

Monday morning: the fastest way to find out if you really have this problem

Pick one production agent and one date from ninety to one-hundred-twenty days ago. Ask your team to produce the complete record of a single user-visible action from that day. Give them two hours.

Do not let them hand-wave with a dashboard. Ask for the exact prompt, system prompt, retrieved context, tool calls, model version, outputs, side effects, timestamp, and proof the record was not altered after capture. Then ask one more question: if outside counsel requested twenty more examples by Friday, could you repeat the process without turning it into a fire drill?

That exercise tells you more than a month of governance meetings. If the answer depends on one staff engineer, three different consoles, and a Slack thread from January, then you already know what the real state of your controls is.

A useful next step is to map that single action against the five record classes above. Mark each class as present, partial, or missing. Partial usually means the logs exist but are not exportable or not defensible. Missing means you never captured the information at all. Either result gives you a concrete remediation plan.

Where Notary fits

Notary exists for exactly this reconstruction problem. It captures agent activity into a normalized schema across providers, signs records at ingestion, timestamps them cryptographically, stores them append-only, and exports evidence packs built for actual audit, discovery, and regulatory workflows. The point is not more dashboards. The point is that when someone asks what your agents did last quarter, you can answer from a system of record instead of starting a scavenger hunt.

If you want to pressure-test your current setup, start with the Notary docs and compare your stack against the record classes in this post. If you want to see what a defensible export looks like, the evidence packs page shows the format teams use when the request is real, time-bound, and adversarial.