Agentic systems fail differently from traditional software. A web service usually fails at a known boundary: an endpoint, a queue, a database call, a deployment. An agent can fail earlier, inside the invisible space between user intent and execution. It can misunderstand the task, choose the wrong tool, pass unsafe parameters, ignore a policy, ask for approval too late, or produce a final answer that hides the messy path it took to get there.
That is why agent observability cannot stop at logs. It needs to preserve the full causal chain:
intent -> plan -> tool calls -> policy checks -> approvals -> outcomes
When those pieces are connected by a single trace, an agent stops feeling like a black box. It becomes a system you can inspect, debug, govern, and improve.
Why Standard Logs Are Not Enough
A normal application log can tell you that a request arrived and a function executed. For agents, that is only the outer shell. The important questions are usually semantic:
- What did the agent think the user wanted?
- Which plan step caused a tool call?
- Was the tool call based on retrieved context, model reasoning, or stale memory?
- Which policy allowed or denied the action?
- Did a human approve the risky step?
- Did the final outcome match the original intent?
If these answers live in separate logs, you will spend every incident reconstructing the story by hand. The observability stack should capture the story as the system runs.
The Trace Envelope
Every agent run should start with a trace envelope. This is the durable object that links every downstream event.
{
"trace_id": "trc_20260603_8f42",
"session_id": "ses_gilroy_lab_19",
"actor": {
"type": "human",
"id": "user_123"
},
"agent": {
"name": "WorkspaceAgent",
"version": "2026.06.03"
},
"intent": {
"raw": "Review the repo and create a draft PR for the auth fix.",
"classified": "CODE_REVIEW_AND_CHANGESET",
"risk": "medium"
},
"started_at": "2026-06-03T15:42:10.120Z"
}This envelope is not just metadata. It is the root of accountability. Every plan step, model response, tool call, policy decision, approval prompt, and outcome should carry the same trace_id.
From Intent To Plan
The first useful observability event is the agent's interpretation of the user request. This should be stored separately from the raw prompt because the interpretation is often where failures begin.
{
"trace_id": "trc_20260603_8f42",
"event_type": "plan.created",
"plan_id": "pln_91b7",
"summary": "Inspect auth code, identify regression, patch minimal files, run tests, prepare PR.",
"steps": [
{
"id": "step_1",
"kind": "inspect",
"target": "auth provider and route guards"
},
{
"id": "step_2",
"kind": "edit",
"target": "minimal auth fix"
},
{
"id": "step_3",
"kind": "verify",
"target": "unit and type checks"
}
]
}This gives you a replayable contract. If the agent later edits an unrelated payment module, you can see whether the plan drifted or whether a tool call escaped the plan boundary.
Tool Calls Need Parentage
Tool observability is more than recording "tool X ran." Each tool call should point back to the plan step that triggered it.
{
"trace_id": "trc_20260603_8f42",
"event_type": "tool.called",
"tool_call_id": "tool_5cc1",
"parent_step_id": "step_1",
"tool": "repo.search",
"input": {
"query": "AuthProvider",
"scope": "src"
},
"started_at": "2026-06-03T15:42:15.018Z"
}The parent step is what turns a tool log into an agent trace. Without it, you can see what happened but not why it happened.
Policy Checks Are First-Class Events
The most important observability signal in an agentic system is often not model output. It is the policy decision that sits between the model and the world.
{
"trace_id": "trc_20260603_8f42",
"event_type": "policy.evaluated",
"policy_id": "workspace.write_guard.v3",
"subject": "WorkspaceAgent",
"action": "file.write",
"resource": "src/auth/AuthProvider.tsx",
"decision": "allow",
"reason": "Path is inside approved workspace and extension is allowed.",
"risk": "medium"
}For denied actions, keep the denial reason precise. "Blocked by policy" is not enough. Engineers need to know whether the issue was path traversal, risky content, missing approval, network exfiltration, or scope mismatch.
Human Approval Should Be Observable
Human-in-the-loop approval is not just a UX pattern. It is an audit event.
{
"trace_id": "trc_20260603_8f42",
"event_type": "approval.resolved",
"approval_id": "apr_7af3",
"requested_action": "file.write",
"resource": "src/auth/AuthProvider.tsx",
"risk": "medium",
"decision": "approved",
"approved_by": "user_123",
"expires_at": "2026-06-03T16:12:20.000Z"
}Approvals should include expiration. A user approving one write now should not silently grant the agent a reusable permission for the rest of the day.
Replay The Failure
The fastest way to improve an agent is to replay a failed run with the original inputs, decisions, and tool outputs. A good trace should make that possible without re-granting dangerous permissions.
type ReplayMode = "dry-run" | "policy-only" | "full";
async function replayTrace(traceId: string, mode: ReplayMode) {
const trace = await traces.load(traceId);
for (const event of trace.events) {
if (event.type === "tool.called" && mode !== "full") {
await replay.recordSimulatedToolResult(event.tool_call_id);
continue;
}
if (event.type === "policy.evaluated") {
await policies.reEvaluate(event, { mode });
}
}
}Most production replay should start in dry-run or policy-only mode. You want to debug reasoning, routing, and policy behavior before allowing any real side effects.
What To Instrument
If you are building an agent runtime, instrument these events first:
| Layer | Event | Why it matters |
|---|---|---|
| Intent | intent.classified | Shows how the agent understood the user |
| Planning | plan.created | Creates a contract for later actions |
| Retrieval | context.loaded | Reveals what context influenced the model |
| Tools | tool.called and tool.completed | Links execution to plan steps |
| Policy | policy.evaluated | Explains allow, deny, and escalation decisions |
| Approval | approval.requested and approval.resolved | Captures human control points |
| Outcome | outcome.completed | Compares the final result to the original intent |
The goal is not to collect everything. The goal is to collect the smallest set of events that lets an engineer answer: what happened, why did it happen, who allowed it, and did it work?
A Minimal Architecture
The stack can be simple:
- Trace Collector: Receives structured events from the agent runtime.
- Policy Logger: Records every allow, deny, and escalation decision.
- Tool Gateway: Wraps tool calls so inputs, outputs, latency, and errors are captured.
- Approval Service: Stores human decisions with scope and expiration.
- Replay Runner: Reconstructs past executions in safe modes.
- Trace UI: Lets engineers inspect a run as a timeline, graph, and audit log.
This architecture keeps observability close to the execution path. The tool gateway and policy layer become the natural places to emit events because every meaningful side effect already passes through them.
The Outcome Event
At the end of a run, close the trace with an outcome event. This is where the system records whether the agent actually satisfied the original request.
{
"trace_id": "trc_20260603_8f42",
"event_type": "outcome.completed",
"status": "success",
"summary": "Auth regression fixed, tests passed, draft PR prepared.",
"artifacts": [
"diff_18a2",
"test_run_77f0",
"draft_pr_31"
],
"matched_intent": true,
"completed_at": "2026-06-03T15:51:44.912Z"
}This final event is useful for product analytics as well as debugging. Over time, you can measure which intents succeed, which tools fail, which policies produce frequent denials, and where humans are repeatedly pulled into the loop.
Conclusion
Agent observability is the discipline of making autonomy inspectable.
The core idea is simple: every meaningful agent action should be connected to the intent that caused it, the plan that justified it, the policy that governed it, and the outcome it produced. Once that chain exists, you can debug agents like real systems instead of reading scattered logs and guessing what the model meant to do.
Trust does not come from making agents sound confident. It comes from making their execution legible.
Related Research
The Agent Loop: Engineering the Cognitive Heartbeat
Beyond one-shot prompts. Designing recursive loops that handle planning, execution, and self-correction without drifting into infinite recursion.
Agentic AIThe Architecture of Autonomy: Building Guardrails for Agentic Systems
Moving from Passive AI (chat) to Active Agents (execution). A shift from simple RBAC to Contextual Governance.
SecuritySecuring the Agentic Workspace: File System Access API & Sandbox Handshakes
Standardizing the permission handshake and AST-level sanitization routines when granting autonomous browser agents native local directory access via the File System Access API.
