On this page
TL;DR
Monitoring an agent is not the same as monitoring a prompt. Agentic systems make multi-step decisions, invoke tools, mutate external state, and reason over retrieved context that itself changes between runs. Five observability primitives - tool-call traces, action authorization audit, retrieval provenance, multi-step replay, and behavioural drift detection - are now the table-stakes layer for any production agent. OWASP LLM06 Excessive Agency and LLM07 Insecure Plugin Design are the canonical threat catalogue. NIST AI 600-1 (July 2024) adds the regulatory expectation. Updated 2026-05-20.
What changes when prompts become agents
A single-turn LLM prompt has a simple observability shape: input, output, latency, cost, model version, and a small set of safety filters. Most teams covered this layer adequately in 2023-2024 using lightweight LLM observability tools (LangSmith, Phoenix, Arize, Helicone, OpenTelemetry GenAI semantic conventions) and basic prompt-and-completion logging.
An agentic system breaks that model. An agent decides, on each step, whether to think, search, call a tool, write to a database, send an email, or terminate. Each tool call can mutate external state. The next step is conditioned on what the previous step returned. The same goal, run twice, can produce two valid traces that touched different tools, retrieved different documents, and arrived at different end states.
The result is a system that looks at runtime more like a distributed workflow engine than a prompt API - and the observability discipline has to match. Five primitives now matter, each described in turn below.
The five agent observability primitives
The five primitives below are the minimum operational discipline for any production agent in a regulated environment. They map directly to OWASP LLM06 Excessive Agency, LLM07 Insecure Plugin Design, and the broader OWASP LLM Top 10 v2 (2025), and they are reinforced by NIST AI 600-1's agent-specific guidance in the Generative AI Profile.
Primitive 1: Tool-call traces
Every tool invocation an agent makes must be captured as a structured trace event, not as a free-text log line. The trace event includes the tool name, the version or schema hash, the input arguments (with sensitive values redacted), the response payload (or hash of it for large responses), the execution duration, the calling agent identity, the parent trace identifier, and the policy decision that authorised the call.
The reference implementation in 2026 is the OpenTelemetry GenAI semantic conventions, which extended the trace specification through 2024-2025 to cover GenAI-specific span attributes for chat interactions, embeddings, and tool calls. Most managed agent platforms now emit OTel-compatible traces natively. For self-hosted agents the lightweight pattern is to wrap each tool with a middleware decorator that emits a span before the tool body executes and adds an event when the body returns. The Areebi audit log stores this trace inline with the AI interaction so policy evaluation and tool calls are reconstructable from a single timeline.
Why this matters for OWASP LLM06: Excessive Agency typically manifests as an agent invoking a tool the operator did not expect it to invoke, often because the prompt was redirected by injection. Without a trace store, the after-the-fact investigation has nothing to investigate.
Primitive 3: Retrieval provenance
Retrieval-augmented generation pipelines must record, for every chunk surfaced to the model, where the chunk came from and when. The provenance record includes the source document identifier, the document version, the chunk index, the embedding model version, the retrieval score, and the access control decision that allowed the chunk to be surfaced.
The practical pattern is to attach a provenance envelope to every retrieved chunk before it enters the prompt and to log the envelope alongside the trace. When a hallucination or data leak surfaces later, provenance lets the investigation answer two questions instantly: "Did the model invent this, or did it parrot a real source?" and "Should this user have been allowed to see this source?"
Why this matters: a large fraction of agent incidents in 2025 traced to retrieval misconfiguration - access controls that did not propagate from the source system to the embedding index, or indexes that captured historical content the source system had since restricted. Without provenance, the investigation is blind. Our healthcare GenAI operations post covers the retrieval governance angle in detail.
Primitive 4: Multi-step replay
An agent's behaviour is reproducible only if the full state at each step - prompt, model parameters, tool responses, retrieved context, policy decisions - is captured in a structured form. Without replay, an incident review has nothing to compare a misbehaviour against, and a red team cannot iterate on adversarial test cases.
The practical pattern is to assign a deterministic trace identifier per agent invocation, capture every span emission to a durable store, and provide a replay UI that reconstructs the conversation, tool calls, and retrievals in chronological order. For non-deterministic models, replay shows the inputs and the actual outputs side by side rather than re-running the model; for deterministic configurations (temperature 0, seeded, identical tool versions), replay can re-execute the full chain.
Why this matters for incident response: under DORA Article 17 the major-incident timeline is 4 hours to initial notification and 72 hours to intermediate report. Replay is the difference between meeting that timeline and missing it.
Primitive 5: Behavioural drift detection
Agent behaviour drifts continuously as foundation models update, tool APIs change, retrieval corpora grow, and prompts evolve. Drift detection establishes a baseline of expected behaviour - response length distribution, tool-call frequency per intent, retrieval depth distribution, refusal rate, latency distribution, cost-per-task - and flags statistically significant shifts for review.
The practical pattern is to compute drift signals on rolling windows (24-hour, 7-day, 30-day) per agent, per intent class, per environment, and to alert when a signal exceeds a threshold. Drift is rarely a security incident by itself; it is a leading indicator that something underneath the agent has changed and should be re-validated. Drift detection is the agent equivalent of canary monitoring in distributed systems.
Why this matters: foundation model providers ship breaking changes to tool-use, structured-output, and safety APIs every few months. Without drift detection the first signal of a regression is a customer-visible incident. With drift detection it is a graph that goes red 6 hours before the customer report lands.
Mapping the primitives to OWASP LLM Top 10
OWASP published the LLM Top 10 v1 in 2023 and the v2 update in 2024-2025. The agentic AI categories - LLM06 Excessive Agency and LLM07 Insecure Plugin Design (renamed in v2 to Vector and Embedding Weaknesses in some draft proposals; the LLM06 and LLM07 references below correspond to the agentic-system risks in the current canonical text) - map directly onto the five primitives above.
| OWASP risk | Manifestation | Primary primitive | Supporting primitive |
|---|---|---|---|
| LLM06 Excessive Agency | Agent invokes a tool the operator did not authorise, often due to prompt injection | Action authorization audit | Tool-call traces, replay |
| LLM07 Insecure Plugin Design | Plugin trusts the agent's inputs without verifying upstream caller authorization | Action authorization audit | Tool-call traces |
| LLM01 Prompt Injection | Adversarial input redirects the agent's behaviour | Tool-call traces | Retrieval provenance, drift detection |
| LLM02 Sensitive Information Disclosure | Agent surfaces data the user should not see | Retrieval provenance | Action authorization, replay |
| LLM08 Excessive Reliance | Operator trusts agent output without verification | Multi-step replay | Drift detection |
| LLM10 Model Theft | Adversary extracts model behaviour or parameters | Drift detection | Tool-call traces, replay |
The Areebi platform implements every primitive above as a first-class feature, which is why the OWASP threat mapping in our platform overview closes against specific control evidence rather than narrative claims.
Get your free AI Risk Score
Take our 2-minute assessment and get a personalised AI governance readiness report with specific recommendations for your organisation.
Start Free AssessmentWhat NIST AI 600-1 expects for agents
NIST AI 600-1 (July 2024), the Generative AI Profile of the AI Risk Management Framework, devotes significant attention to autonomous and agentic systems. The Profile names increased capabilities for chained or autonomous decisions as a top risk category and expects organisations to extend MAP, MEASURE, and MANAGE practices to cover agent-specific failure modes.
Three Profile expectations matter most for the observability stack. First, the organisation must MAP the full set of tools and integrations an agent can invoke, including conditional or dynamically discovered tools. Second, MEASURE practices must include red-team exercises that target the agent's tool-use surface and the retrieval pipelines, not just the prompt interface. Third, MANAGE practices must include incident response procedures that can reconstruct the agent's decision chain for any incident - which directly implies tool-call traces and replay.
For UK and EU readers, the NIST guidance aligns with the ICO and the European Data Protection Board guidance on automated decision-making, both of which require explainability and audit evidence for consequential decisions automated by AI. Our NIST AI RMF implementation guide walks through the broader Profile context.
What the 2026 agent observability stack looks like
A working stack today combines a small number of well-defined layers, each with a clear job. The composition below is what Areebi recommends to teams evaluating their first production agent deployment.
- Instrumentation: OpenTelemetry GenAI semantic conventions inside the agent runtime, emitting spans for prompts, completions, tool calls, retrievals, and policy decisions.
- Trace store: Either a managed LLM observability tool (LangSmith, Arize Phoenix, Helicone) or a self-hosted OTel-compatible backend (Jaeger, Tempo, Honeycomb). The Areebi audit log accepts OTel ingestion alongside platform-native interaction events so traces and audit timelines reconcile.
- Authorization layer: A policy decision point that wraps every consequential tool with a policy evaluation. Open-source primitives include OPA, Cedar, and the policy engines inside LLM gateways. The Areebi policy engine implements this pattern for AnythingLLM deployments.
- Retrieval governance: Provenance envelopes attached to every retrieved chunk plus access control propagation from source systems through the embedding index. Our shadow AI guide covers the surface area.
- Replay and review UI: A timeline view that reconstructs the agent's conversation, tool calls, and retrievals in chronological order, with overlay of policy decisions and provenance.
- Drift detection: Rolling-window statistical monitors over response length, tool-call frequency, retrieval depth, refusal rate, latency, and cost-per-task. Alerts feed the same incident channel as security events.
At Areebi, we built the platform around this stack because the alternative - bolting agent observability on after the fact, post-incident - is the most expensive way to learn the discipline. The AI Governance Assessment includes an agent observability scoring module against this stack.
Common pitfalls
Pitfall 1: Treating an agent like a prompt for monitoring purposes. Prompt observability tools capture input, output, and basic safety filtering. They do not capture tool-call traces, action authorisation decisions, retrieval provenance, or behavioural drift. Teams that adopted prompt observability in 2023 and have not upgraded their stack are operating agents blind to the failure modes that matter.
Pitfall 2: Trusting the model to enforce authorization. An agent prompt that says "only call the delete-customer tool if the user is an admin" is not access control - it is a suggestion. Authorization must live in a layer the LLM cannot bypass. The OWASP LLM06 catalogue exists because every team learns this the first time an injection redirects the agent.
Pitfall 3: Ignoring retrieval governance. Most agent incidents in 2025 traced to retrieval failures rather than prompt failures - chunks surfaced to a user who should not have seen them, or stale chunks surfaced after the source had been restricted. Retrieval provenance is unglamorous but it is where the largest production-incident class lives.
Pitfall 4: Static dashboards instead of drift alarms. Dashboards that humans visit on Mondays do not catch foundation-model regressions that arrive on Tuesdays. Drift detection has to alert in the same incident channel as security events, with the same on-call discipline.
What to read next
To go from agent observability concepts to production discipline, work through this cluster.
- Singapore agentic AI governance - the regulatory framing for agentic systems in APAC.
- Prompt injection prevention - the upstream control layer that complements agent monitoring.
- AI red teaming guide - the offensive discipline that validates the observability stack.
- LLM attack vectors 2026 - the broader threat landscape that includes agent-specific attacks.
- NIST AI RMF implementation guide - the framework context that wraps agent observability inside enterprise risk management.
Frequently Asked Questions
How is AI agent monitoring different from LLM prompt monitoring?
Prompt monitoring captures input, output, latency, cost, model version, and basic safety filters - a flat, single-turn shape. Agent monitoring must additionally capture tool-call traces (every tool the agent invoked and with what arguments), action authorization decisions (whether each consequential call was permitted), retrieval provenance (where each retrieved chunk came from and whether the user was allowed to see it), multi-step replay (the full conversational and tool-call timeline for a session), and behavioural drift signals over time. The shapes are different because agents make decisions and mutate state, where prompts only return text.
What does OWASP LLM06 Excessive Agency actually mean for agents?
LLM06 Excessive Agency describes the risk that an agent has the ability to take actions in the world (call tools, mutate databases, send emails, transfer funds, escalate privileges) that exceed what the operator intended, typically because the prompt was redirected by injection, the toolset was over-permissioned, or the authorisation layer trusted the LLM's judgement rather than enforcing rules externally. The mitigation is a policy decision point sitting between the agent and every consequential tool, evaluating each call against caller identity, agent identity, tool identifier, and resource - independently of the LLM.
What does OpenTelemetry's GenAI semantic convention add to agent observability?
OpenTelemetry GenAI semantic conventions extend the OTel trace specification with span attributes specific to GenAI workloads: model identifier, prompt and completion token counts, tool invocation metadata, embedding model identifier, and conversation context. Most managed agent platforms now emit OTel-compatible traces natively. Adopting OTel GenAI semconv means an organisation's existing distributed tracing infrastructure (Jaeger, Tempo, Honeycomb, Datadog APM) can ingest and query agent telemetry alongside service-level traces, which is the single largest operational win in the 2024-2025 standardisation cycle.
Where does retrieval provenance fit in?
Retrieval provenance is the structured record that travels with every chunk surfaced to the model from a retrieval-augmented pipeline. It captures the source document identifier, document version, chunk index, embedding model version, retrieval score, and the access control decision that allowed the chunk to be surfaced. When a hallucination or data-leak incident surfaces later, provenance lets the investigation answer two questions instantly: did the model invent this content or parrot a real source, and was the user actually allowed to see that source. Retrieval failures dominated the 2025 agent incident profile, which is why provenance has moved from nice-to-have to baseline.
How do I detect agent behaviour drift?
Establish a baseline over the first 30 days of agent operation across six signals - response length distribution, tool-call frequency per intent class, retrieval depth distribution, refusal rate, latency distribution, and cost-per-task. Then compute rolling-window statistics (24-hour, 7-day, 30-day) and alert when any signal exceeds a defined threshold. Drift is rarely a security incident by itself, but it is a leading indicator that something underneath the agent (the foundation model, a tool API, the retrieval corpus, or a prompt template) has changed and should be re-validated. Drift detection is the agent equivalent of canary monitoring in distributed systems.
What does NIST AI 600-1 say about agents specifically?
NIST AI 600-1 (the July 2024 Generative AI Profile of the AI RMF) names increased capabilities for chained or autonomous decisions as a top GenAI risk category. The Profile expects organisations to MAP the full set of tools and integrations an agent can invoke, MEASURE the agent's behaviour through red-team exercises that target tool-use and retrieval pipelines (not just the prompt interface), and MANAGE incident response procedures that can reconstruct the agent's decision chain for any incident. The last expectation is the regulatory pull behind tool-call traces and replay becoming non-negotiable.
Related Resources
Stay ahead of AI governance
Weekly insights on enterprise AI security, compliance updates, and governance best practices.
Stay ahead of AI governance
Weekly insights on enterprise AI security, compliance updates, and best practices.
About the Author
Areebi Research
The Areebi research team combines hands-on enterprise security work with deep AI governance research. Our analysis is informed by primary sources (NIST, ISO, OECD, federal registers, IAPP) and the operational realities of CISOs running AI programs in regulated industries today.
Ready to govern your AI?
See how Areebi can help your organization adopt AI securely and compliantly.