Back
Agentic AIArchitectureObservabilitySecurity

The Agent Observability Stack: Tracing Intent, Tools, Policy, and Outcomes

Agents become trustworthy when every intent, plan, tool call, policy decision, approval, and outcome can be inspected as one coherent trace.

Agentic systems fail differently from traditional software. A web service usually fails at a known boundary: an endpoint, a queue, a database call, a deployment. An agent can fail earlier, inside the invisible space between user intent and execution. It can misunderstand the task, choose the wrong tool, pass unsafe parameters, ignore a policy, ask for approval too late, or produce a final answer that hides the messy path it took to get there.

That is why agent observability cannot stop at logs. It needs to preserve the full causal chain:

intent -> plan -> tool calls -> policy checks -> approvals -> outcomes

When those pieces are connected by a single trace, an agent stops feeling like a black box. It becomes a system you can inspect, debug, govern, and improve.

Why Standard Logs Are Not Enough

A normal application log can tell you that a request arrived and a function executed. For agents, that is only the outer shell. The important questions are usually semantic:

  • What did the agent think the user wanted?
  • Which plan step caused a tool call?
  • Was the tool call based on retrieved context, model reasoning, or stale memory?
  • Which policy allowed or denied the action?
  • Did a human approve the risky step?
  • Did the final outcome match the original intent?

If these answers live in separate logs, you will spend every incident reconstructing the story by hand. The observability stack should capture the story as the system runs.

The Trace Envelope

Every agent run should start with a trace envelope. This is the durable object that links every downstream event.

json.SNIPPET
{ "trace_id": "trc_20260603_8f42", "session_id": "ses_gilroy_lab_19", "actor": { "type": "human", "id": "user_123" }, "agent": { "name": "WorkspaceAgent", "version": "2026.06.03" }, "intent": { "raw": "Review the repo and create a draft PR for the auth fix.", "classified": "CODE_REVIEW_AND_CHANGESET", "risk": "medium" }, "started_at": "2026-06-03T15:42:10.120Z" }

This envelope is not just metadata. It is the root of accountability. Every plan step, model response, tool call, policy decision, approval prompt, and outcome should carry the same trace_id.

From Intent To Plan

The first useful observability event is the agent's interpretation of the user request. This should be stored separately from the raw prompt because the interpretation is often where failures begin.

json.SNIPPET
{ "trace_id": "trc_20260603_8f42", "event_type": "plan.created", "plan_id": "pln_91b7", "summary": "Inspect auth code, identify regression, patch minimal files, run tests, prepare PR.", "steps": [ { "id": "step_1", "kind": "inspect", "target": "auth provider and route guards" }, { "id": "step_2", "kind": "edit", "target": "minimal auth fix" }, { "id": "step_3", "kind": "verify", "target": "unit and type checks" } ] }

This gives you a replayable contract. If the agent later edits an unrelated payment module, you can see whether the plan drifted or whether a tool call escaped the plan boundary.

Tool Calls Need Parentage

Tool observability is more than recording "tool X ran." Each tool call should point back to the plan step that triggered it.

json.SNIPPET
{ "trace_id": "trc_20260603_8f42", "event_type": "tool.called", "tool_call_id": "tool_5cc1", "parent_step_id": "step_1", "tool": "repo.search", "input": { "query": "AuthProvider", "scope": "src" }, "started_at": "2026-06-03T15:42:15.018Z" }

The parent step is what turns a tool log into an agent trace. Without it, you can see what happened but not why it happened.

Policy Checks Are First-Class Events

The most important observability signal in an agentic system is often not model output. It is the policy decision that sits between the model and the world.

json.SNIPPET
{ "trace_id": "trc_20260603_8f42", "event_type": "policy.evaluated", "policy_id": "workspace.write_guard.v3", "subject": "WorkspaceAgent", "action": "file.write", "resource": "src/auth/AuthProvider.tsx", "decision": "allow", "reason": "Path is inside approved workspace and extension is allowed.", "risk": "medium" }

For denied actions, keep the denial reason precise. "Blocked by policy" is not enough. Engineers need to know whether the issue was path traversal, risky content, missing approval, network exfiltration, or scope mismatch.

Human Approval Should Be Observable

Human-in-the-loop approval is not just a UX pattern. It is an audit event.

json.SNIPPET
{ "trace_id": "trc_20260603_8f42", "event_type": "approval.resolved", "approval_id": "apr_7af3", "requested_action": "file.write", "resource": "src/auth/AuthProvider.tsx", "risk": "medium", "decision": "approved", "approved_by": "user_123", "expires_at": "2026-06-03T16:12:20.000Z" }

Approvals should include expiration. A user approving one write now should not silently grant the agent a reusable permission for the rest of the day.

Replay The Failure

The fastest way to improve an agent is to replay a failed run with the original inputs, decisions, and tool outputs. A good trace should make that possible without re-granting dangerous permissions.

TS.SNIPPET
type ReplayMode = "dry-run" | "policy-only" | "full"; async function replayTrace(traceId: string, mode: ReplayMode) { const trace = await traces.load(traceId); for (const event of trace.events) { if (event.type === "tool.called" && mode !== "full") { await replay.recordSimulatedToolResult(event.tool_call_id); continue; } if (event.type === "policy.evaluated") { await policies.reEvaluate(event, { mode }); } } }

Most production replay should start in dry-run or policy-only mode. You want to debug reasoning, routing, and policy behavior before allowing any real side effects.

What To Instrument

If you are building an agent runtime, instrument these events first:

LayerEventWhy it matters
Intentintent.classifiedShows how the agent understood the user
Planningplan.createdCreates a contract for later actions
Retrievalcontext.loadedReveals what context influenced the model
Toolstool.called and tool.completedLinks execution to plan steps
Policypolicy.evaluatedExplains allow, deny, and escalation decisions
Approvalapproval.requested and approval.resolvedCaptures human control points
Outcomeoutcome.completedCompares the final result to the original intent

The goal is not to collect everything. The goal is to collect the smallest set of events that lets an engineer answer: what happened, why did it happen, who allowed it, and did it work?

A Minimal Architecture

The stack can be simple:

  1. Trace Collector: Receives structured events from the agent runtime.
  2. Policy Logger: Records every allow, deny, and escalation decision.
  3. Tool Gateway: Wraps tool calls so inputs, outputs, latency, and errors are captured.
  4. Approval Service: Stores human decisions with scope and expiration.
  5. Replay Runner: Reconstructs past executions in safe modes.
  6. Trace UI: Lets engineers inspect a run as a timeline, graph, and audit log.

This architecture keeps observability close to the execution path. The tool gateway and policy layer become the natural places to emit events because every meaningful side effect already passes through them.

The Outcome Event

At the end of a run, close the trace with an outcome event. This is where the system records whether the agent actually satisfied the original request.

json.SNIPPET
{ "trace_id": "trc_20260603_8f42", "event_type": "outcome.completed", "status": "success", "summary": "Auth regression fixed, tests passed, draft PR prepared.", "artifacts": [ "diff_18a2", "test_run_77f0", "draft_pr_31" ], "matched_intent": true, "completed_at": "2026-06-03T15:51:44.912Z" }

This final event is useful for product analytics as well as debugging. Over time, you can measure which intents succeed, which tools fail, which policies produce frequent denials, and where humans are repeatedly pulled into the loop.

Conclusion

Agent observability is the discipline of making autonomy inspectable.

The core idea is simple: every meaningful agent action should be connected to the intent that caused it, the plan that justified it, the policy that governed it, and the outcome it produced. Once that chain exists, you can debug agents like real systems instead of reading scattered logs and guessing what the model meant to do.

Trust does not come from making agents sound confident. It comes from making their execution legible.

Read more articles

Explore the full tech feed for more research.