Notary
Back to blog

Regulator Asked About Our AI Agents and We Can’t Answer: What to Do in the First 72 Hours

By Notary Team

A regulator asked about our AI agents and we can’t answer. For most teams, that sentence arrives as a shock, then quickly becomes a governance stress test.

The GC wants a factual timeline. The CISO wants to preserve records before retention windows expire. Platform engineering wants to know which systems are authoritative. Meanwhile the regulator has a deadline, and deadlines do not care that your traces are split across OpenAI dashboards, Datadog, and two internal services nobody has touched in six months.

This is the uncomfortable truth: observability coverage is not evidence readiness. You can have excellent dashboards and still fail a basic regulatory request.

When a regulator asked about our AI agents and we can’t answer, the question is not whether your model is “good.” The question is whether you can produce complete, authenticated, and explainable records for a specific period and decision path. If not, your risk is legal, operational, and reputational at once.

Regulator asked about our AI agents and we can’t answer: start with legal scope, not log collection

The first bad instinct is to collect everything. Teams call this “being safe.” In practice it creates contradictory exports, chain-of-custody confusion, and review debt.

Start with scope control. What exactly did the regulator ask for, and under which authority?

If the request references EU AI Act Article 12, your response must address automatically generated logs and record-keeping proportional to system purpose and risk. If you are preparing controls for SOC 2 CC7.2, you need evidence that monitoring and anomaly-response controls are operating consistently. If US litigation risk is adjacent, Federal Rule of Evidence 901 authentication questions and FRCP Rule 34 production expectations matter immediately.

Build a one-page request scope memo within the first two hours:

  1. Date and time range requested.
  2. Systems and agent workflows in scope.
  3. Data subjects or transaction classes in scope.
  4. Required event fields.
  5. Delivery format expected by counsel or regulator.
  6. Named owners for legal, security, and platform workstreams.

Without this memo, every team answers a different question and nobody answers the regulator’s actual question.

Why your current logging stack usually fails under scrutiny

Most teams already have heavy logging spend. That does not mean they have defensible records.

Datadog, Splunk, CloudWatch, and OpenTelemetry are optimized for reliability operations. They solve latency, error rate, and service health problems. They are not designed as neutral systems of record for legal-grade reconstruction.

Three technical gaps show up repeatedly.

First, retention is misaligned. Operational logs are often kept for 30 to 90 days. Regulatory timelines, internal investigations, and litigation holds can require years.

Second, integrity guarantees are incomplete. Access controls can limit who edits records, but they do not prove records were never altered. “Only admins can change it” is not the same as “it is cryptographically detectable if changed.”

Third, schema fragmentation is severe in AI stacks. OpenAI tool calls, Anthropic events, gateway traces, and application logs rarely line up field-for-field. Under deadline, teams end up manually reconciling JSON variants and introducing errors in the narrative.

This is the contrarian point worth repeating: mature observability can coexist with poor evidentiary posture. They are different disciplines.

The first 72 hours: seven workstreams that prevent panic-driven mistakes

Treat this as incident response with legal impact. Run parallel tracks with named accountable owners.

1) Preservation and legal hold

Issue preservation notices before broad querying. Include provider logs, gateway traces, SIEM indexes, object stores, and any backup location that may contain agent records.

Target output by hour 6: hold notice distributed, recipients confirmed, acknowledgments tracked.

2) Source inventory with confidence scoring

List every source of truth candidate and assign confidence labels:

  • A: signed, immutable, complete.
  • B: complete but mutable.
  • C: partial and mutable.

Target output by hour 12: source matrix with owner, retention window, integrity notes, and retrieval path.

3) Canonical schema lock

Define a single schema for this response packet and freeze it. Example required fields:

  • event_id
  • agent_id
  • session_id
  • timestamp_utc
  • user_input
  • retrieved_context_ref
  • model_provider
  • model_name
  • model_version
  • tool_name
  • tool_args
  • tool_result
  • policy_decision
  • final_output
  • operator_override
  • integrity_hash/signature metadata

Target output by hour 18: schema v1 approved by legal plus engineering.

4) Time reconciliation and event ordering

Normalize timestamps to UTC and preserve original timezone + source clock in metadata. Build one master timeline that records confidence level for each sequence edge.

Target output by hour 24: ordered chronology with gaps flagged, not hidden.

5) Chain-of-custody controls

For every extraction, log who initiated it, when, from which system, and with which method. Hash the resulting files and store a manifest. If you have RFC 3161 timestamp tokens, attach them now.

Target output by hour 36: custody register + hash manifest bundle.

6) Narrative packet draft

Start narrative drafting before data collection is “perfect.” Your first version should explicitly separate confirmed facts, unresolved gaps, and remediation underway.

Target output by hour 48: counsel-reviewed response draft tied to artifact IDs.

7) Executive decision brief

Create a one-page brief for GC, CISO, and executive sponsor: confidence level, unresolved risks, potential disclosure impact, and immediate control upgrades.

Target output by hour 72: signed decision memo with owners and deadlines.

Regulator asked about our AI agents and we can’t answer: four failure modes that increase exposure

These are the mistakes that repeatedly turn a manageable request into a bigger problem.

Failure mode 1: over-collection without traceability

Teams pull massive log volumes without preserving extraction metadata. Later they cannot prove which export produced which conclusion.

Failure mode 2: claiming tamper-evidence without cryptographic proof

Saying records are “immutable” because access is restricted invites challenge. If there is no per-record signature, hash chain, or external timestamp proof, integrity remains a policy claim, not a technical fact.

Failure mode 3: timeline drift between legal and engineering packets

If legal summarizes one timestamp while engineering later revises it, credibility erodes fast. Use one master chronology and version control it.

Failure mode 4: no post-response architecture change

Some teams survive one regulator cycle by heroics, then do nothing. The next inquiry arrives and they repeat the scramble at higher cost.

What a defensible AI-agent evidence layer should include

If you want to avoid repeating this fire drill, design for evidence as a first-class system.

A defensible stack usually includes:

  1. Cross-provider normalization across OpenAI, Anthropic, and internal agent frameworks so one query spans all execution paths.
  2. Ingestion-time signing so integrity is anchored at capture, not asserted later.
  3. Append-only storage semantics with deletion detection, often via hash-linked records or Merkle structures.
  4. Trusted timestamping through RFC 3161-compatible authorities for stronger temporal claims.
  5. Retention by obligation (regulatory, contractual, and legal hold), not just storage cost targets.
  6. Framework-mapped exports for EU AI Act, SOC 2, HIPAA Security Rule 164.312(b), NIST AI RMF, and ISO 42001.
  7. Verification workflows that outside counsel or auditors can execute independently.

Notice what is not on this list: prettier dashboards. This is an evidence problem, not a visualization problem.

A concrete scenario: underwriting dispute response

Consider a lending workflow where an AI agent prepares recommendation memos for underwriters.

A regulator asks for all records tied to declined applicants in a specific period, including model inputs, external data retrieved, tool calls, and final recommendation text.

If your data is fragmented, you may produce:

  • gateway logs with request IDs,
  • model traces without business context,
  • final decisions in a separate core system,
  • and no consistent event key across all three.

Now imagine the same request with an evidence layer in place. You query by applicant and date window, export a single signed timeline containing input context, provider/model metadata, tool interactions, and decision outputs, then attach custody and integrity proofs. The legal question shifts from reconstruction to explanation, which is exactly where you want to be.

Metrics that tell you whether you are ready

Track readiness like any other control domain. Suggested metrics:

  • Mean time to produce a complete agent timeline for one case.
  • Percentage of agent events with full required schema fields.
  • Percentage of records with verifiable signatures.
  • Coverage of retention policies versus legal obligations.
  • Number of unresolved chronology gaps per quarterly tabletop.

If you cannot report these, you do not have a reliable baseline.

What to do Monday morning

Run a two-hour tabletop on your highest-risk agent workflow with legal, security, and platform engineering in one room.

Use this prompt: “If a regulator asked for this workflow’s last 90 days of actions by noon tomorrow, what exactly would we deliver?”

Score your output from 0 to 2 across five categories:

  • completeness,
  • chronology,
  • integrity,
  • custody,
  • export usability.

A score below 8 out of 10 means your organization is still relying on heroics.

Then assign one owner to close one gap this week, not ten gaps next quarter.

If you want a concrete model for building this evidence layer, review Notary’s docs and compare your current process to the structure of Notary evidence packs. The goal is straightforward: before the next regulator letter arrives, move from “we can’t answer” to “here is the signed record.”

Vendor and board questions you should be ready to answer

When your organization is under inquiry, external counsel, auditors, and board members tend to ask the same hard questions. Preparing clear answers now reduces escalation later.

Can you prove completeness for the requested scope?

You should be able to explain inclusion logic plainly: which systems were queried, which identifiers were used, and why those identifiers are sufficient to capture all relevant records. If there are known exclusions, state them and quantify impact.

Can an independent reviewer verify integrity without trusting your internal team?

A credible answer includes verification steps, keys or certificates, and reproducible checks. If verification depends only on “trust our admins,” expect pushback.

What changed after prior incidents or audits?

Regulators and boards look for institutional learning. Show concrete control improvements with dates, owners, and validation evidence.

How quickly can you repeat this process next month?

A one-time heroic response is weaker than a repeatable operating model. Demonstrate runbooks, tested exports, and periodic tabletop results.

Implementation roadmap: from scramble to repeatable control

A realistic 90-day roadmap usually beats a multi-year transformation plan.

Days 1–30: Stabilize capture and scope discipline

  • Lock canonical schema for high-risk workflows.
  • Close obvious retention gaps.
  • Establish legal-hold trigger and ownership.
  • Run one weekly reconstruction drill.

Days 31–60: Add integrity guarantees

  • Introduce signing at ingestion or nearest trusted point.
  • Add hash manifest generation for all exports.
  • Validate timestamp strategy, including trusted authority integration where feasible.
  • Document verification procedure for counsel.

Days 61–90: Operationalize exports and governance

  • Build framework-specific export templates.
  • Add executive-level readiness metrics to monthly risk review.
  • Run cross-functional tabletop with GC, CISO, and platform leads.
  • Publish a living runbook with version history.

This roadmap is intentionally practical. The goal is not theoretical perfection. The goal is to ensure the next regulator request is processed as a controlled workflow, not a crisis.