What Is an AI Agent Evidence Platform (and How to Evaluate One)

An AI agent evidence platform is the system of record for what your agents did, captured in a form that survives adversarial scrutiny. It is not an observability tool. It is not a GRC tool. It is not a SIEM. It sits next to all three and answers the single question observability, GRC, and SIEM were never designed to answer: can you prove, to a skeptical third party, exactly what the agent received as input, what it produced as output, when it happened, and that the record has not been touched since?

If you are reading this post, you probably already know the category exists because someone in your organisation asked a question your current stack could not answer. Your general counsel asked for a complete record of a pricing agent's decisions. Your auditor wrote a finding against the retention window on your observability logs. A regulator sent a letter citing Article 12 of the EU AI Act. Your CISO was asked to sign a declaration about the integrity of agent records and declined.

This post is the evaluation guide. What the category actually is, the seven capabilities that separate real evidence platforms from repackaged log aggregators, the failure modes to watch for, and the questions to ask a vendor on the first call.

Why observability and GRC tools do not cover this

The most common mistake in the buying process is assuming your existing stack already does the job. It almost never does, and the reason is structural rather than a feature gap.

Observability tools (Datadog, New Relic, Honeycomb, LangSmith) are optimised for latency, error rates, and debugging. Retention is set by cost. Log lines are mutable by design so that operators can redact, reshape, and drop noise. There is no cryptographic signing because signing would break the pipeline's throughput model. These choices are correct for the job the tool was built to do. They are disqualifying for evidence.

GRC platforms (Vanta, Drata, Secureframe) are optimised for policy documentation and control attestation. They track whether you have a control, not whether the control is producing a tamper-evident record of every agent action. They read evidence from your other systems. They do not produce it.

SIEMs (Splunk, Elastic, Chronicle) were built for security events, not for agent decisions. They can be bent toward audit-trail use cases, but the schema is wrong, the retention is wrong, and the export formats are not mapped to AI-specific frameworks.

An AI agent evidence platform is a distinct product. It ingests from your LLM providers, your orchestration layer, and your application code. It cryptographically signs each record at the point of ingestion. It stores the record in an append-only substrate. It produces evidence packs mapped one-to-one against the frameworks your auditors and regulators actually cite.

The seven capabilities that define the category

When you are evaluating vendors in this space, every claim should reduce to one of these seven questions. If a vendor cannot answer all seven cleanly, they are selling something adjacent, not an evidence platform.

1. Cross-provider ingestion with a normalised schema

Your agents almost certainly call more than one model provider. OpenAI's API emits tool calls in one shape. Anthropic's emits them in another. Vertex and Bedrock each have their own. If your evidence layer stores these in their native shapes, you cannot answer a regulator's question that spans providers without writing translation code under time pressure.

A real evidence platform ingests from every provider your agents touch and normalises to a single schema: input, output, tool calls, model identity, model version, system prompt, retrieved context, timing, and configuration. The normalised schema is what makes search, export, and cross-provider audit possible.

The evaluation question: show me the schema. If the vendor cannot produce a stable, versioned schema document, keep looking.

2. Cryptographic signing at ingestion

This is the capability most commonly faked. A vendor will tell you the log store is "tamper-evident" because it runs on immutable infrastructure or because access is restricted. Neither is the same thing.

Tamper-evident, in a way that holds up in front of opposing counsel, means each record is signed at the moment of ingestion using a key controlled by the evidence platform, not by your operators. The signature covers the full record, including the timestamp and a reference to the previous record in the chain. Any subsequent modification breaks the signature mathematically.

The evaluation question: what signs the record, when is it signed, and can I independently verify a signature using a public key? If the answer involves any of your operators' credentials, the record is not evidence in any meaningful sense.

3. RFC 3161 timestamping

A server clock can be backdated. An operator with the right permissions can change a record's timestamp field. Neither survives cross-examination.

RFC 3161 is an internet standard for trusted timestamps: you hash the record, send the hash to a timestamp authority, and receive back a signed token proving the record existed at that instant. The token is verifiable by anyone, forever, without trusting you or your vendor.

The evaluation question: does the platform use an RFC 3161 timestamp authority, which authority, and can I see a sample token? If timestamps come from a server in the platform's own infrastructure, they do not clear the evidentiary bar.

4. Append-only storage with hash-chained records

Signing is necessary but not sufficient. A signature proves a single record has not been modified. A hash chain proves the sequence has not been modified. If a bad actor deleted record 47 from a log store of 1,000 records, signature verification on each surviving record would still pass. Hash chaining (or a Merkle tree structure, which is a generalisation) makes that deletion detectable because record 48's hash would no longer match the expected predecessor.

A real evidence platform either hashes each record to its predecessor or anchors batches into a Merkle root that is published to an external, durable substrate (a public blockchain, a transparency log, an RFC 6962 certificate transparency-style ledger, or similar).

The evaluation question: describe the chain structure, and show me what it would take to detect a single record deletion mid-sequence.

5. Retention tuned to legal, not cost

Observability tools default to 30, 60, or 90 day retention because storage is expensive and most debugging happens within days of an incident. Legal retention for AI agents is a different universe. The EU AI Act requires automatically generated logs to be retained for "an appropriate period in light of the intended purpose of the high-risk AI system," which in practice lands at several years. HIPAA requires six years. SOX requires seven. Litigation holds can extend indefinitely.

An evidence platform needs retention policies that can be set per-agent, per-framework, with legal-hold override, and with cryptographic proof that retention was actually honoured. If retention is a slider in the vendor's billing dashboard, treat it with suspicion.

The evaluation question: can I set retention per agent, apply a legal hold, and get a signed attestation that the hold is in force?

6. Framework-mapped export packs

The ultimate deliverable is not a dashboard. It is an evidence pack. A regulator, an auditor, or opposing counsel asks for something specific, and you hand them a bundle that answers the request in the form they expect.

"The form they expect" is non-trivial. EU AI Act Article 12 evidence is shaped differently from SOC 2 CC7.2 evidence, which is shaped differently from HIPAA Security Rule 164.312(b) evidence, which is shaped differently from a Rule 34 production request in federal litigation. A real platform has pre-built export packs mapped one-to-one against each framework, with the relevant records selected, the integrity proofs attached, and a chain-of-custody affidavit template included.

The evaluation question: show me the list of framework packs, and show me a sample export for one of them.

7. Deposition-ready chain of custody

The last capability is the one most often missing even from platforms that nail the first six. When the evidence pack lands on the opposing expert's desk, someone has to be able to declare, under penalty of perjury, how the records were captured, signed, stored, retrieved, and produced. That declaration is a legal document, not a support ticket. It names the systems, the roles, the access controls, and the verification procedures.

A mature evidence platform ships a chain-of-custody affidavit template co-signed by the vendor's compliance officer, updated as the platform changes, and accompanied by the public keys and verification tooling the other side will need.

The evaluation question: does the platform ship a chain-of-custody affidavit template, is it updated when architecture changes, and will the vendor stand behind it?

Four failure modes to watch for in the evaluation

Beyond the seven capabilities, there are four patterns that routinely disqualify otherwise-impressive vendors.

"Tamper-evident" without cryptography. The word gets used to describe access controls, immutable infrastructure, or audit logs on the log store. None of these survive the question "can you prove it?" Press until you see keys, signatures, and verification tooling.

Storage in a system the customer controls. If the evidence lives in your own S3 bucket, under your own IAM, accessible to your own operators, then your operators are a single point of compromise for the integrity claim. The whole point of an evidence platform is that the integrity claim does not rest on trusting you. Evidence should live in a substrate where even the vendor cannot retroactively modify it without detection, and certainly where the customer cannot.

Exports that are CSV dumps. A CSV of log lines is not an evidence pack. It is a spreadsheet. If the vendor's answer to "what do I hand the regulator?" is "we have an export button," they have not built the last mile of the product.

No story for the integrity of the ingestion path. The strongest signing at rest means nothing if an attacker could have modified the record between your agent emitting it and the platform signing it. Ask specifically: what is the integrity guarantee of the path from the agent's process to the moment of signing? The best platforms sign at the client library, in the same process as the agent, before the record ever leaves your infrastructure.

A first-call question list

Take this to the vendor on the first call:

Walk me through the lifecycle of a single agent action, from the model's response to a signed, stored record in your system. Name every hop and every trust boundary.

Which timestamp authority do you use, and can I verify a sample token against it using standard RFC 3161 tools?

Describe the chain structure. How would a single record deletion be detected?

Show me the normalised schema for a tool call, and show me the same tool call as emitted by three different providers, ingested and normalised.

Which framework export packs do you ship today? Show me a sample EU AI Act Article 12 pack end to end.

What does the chain-of-custody affidavit look like, who signs it, and how is it updated when you ship architectural changes?

Under what circumstances could your own operators, or a compromised operator credential, modify a stored record without detection?

If you get clean, confident answers to all seven, you are looking at an AI agent evidence platform. If you get hedging, marketing language, or a redirection to a roadmap, keep looking.

Where Notary fits

Notary is an AI agent evidence platform built to the seven capabilities described above. Cross-provider ingestion normalises OpenAI, Anthropic, Vertex, and Bedrock to a single schema. Signing happens at the client library in the agent's own process. Timestamps come from an RFC 3161 timestamp authority, with batched anchoring into a public transparency log. Retention is per-agent, with legal-hold override and signed attestations. Export packs ship for the EU AI Act, SOC 2, HIPAA, NIST AI RMF, and ISO 42001, each with a chain-of-custody affidavit template co-signed by our compliance officer.

If you are building the evaluation matrix right now, the Notary docs walk through each of the seven capabilities in detail, and the evidence pack gallery shows sample exports for the frameworks your auditors and regulators are most likely to cite. If you would rather talk to a human, book a technical walkthrough and we will run the first-call question list in front of you, live.