Incident Response for Agents

Incident response for agents illustration

1. Introduction: Why Agent Incidents Are Different

Traditional incident response assumes a human adversary or a deterministic piece of malware. The responder traces known attack patterns, isolates compromised hosts, and follows a linear chain of exploitation. Autonomous AI agents break every one of those assumptions. An agent operates with delegated authority, makes multi-step decisions without human approval, and can pivot across tools and data sources faster than any analyst can monitor in real time.

When an agent-related incident occurs, responders face challenges that have no parallel in classical cybersecurity:

The fundamental shift is this: in agent incidents, the "attacker" may be your own infrastructure acting on corrupted instructions it believes are legitimate. Your response framework must account for an adversary that looks exactly like a trusted employee.

This playbook provides Australian organisations with a structured, repeatable workflow for containing, investigating, and recovering from incidents involving autonomous AI agents. It is designed to align with the Australian Cyber Security Centre (ACSC) incident response guidelines while addressing the unique characteristics of agentic AI systems.

2. Agent Incident Taxonomy

Before you can respond, you must classify. The following taxonomy covers the six primary categories of agent incidents, each demanding distinct containment strategies.

Category Description Severity Baseline
Prompt Injection Breach An external or internal input overrides the agent's system instructions, causing it to execute attacker-controlled directives. This includes both direct injection (malicious user input) and indirect injection (poisoned data from retrieved documents or APIs). Critical
Unauthorised Tool Execution The agent invokes a tool or API endpoint outside its permitted scope. This may result from permission misconfigurations, prompt manipulation, or model hallucination of tool names that happen to match real endpoints. High
Data Exfiltration by Agent The agent transmits sensitive data to an unauthorised destination, whether through tool calls, HTTP requests, email composition, or embedding data in outbound messages. The agent may not "know" it is exfiltrating. Critical
Runaway Agent The agent enters an infinite loop, recursive call pattern, or resource-exhaustion spiral. This can consume API quotas, overwhelm downstream services, or generate massive costs in minutes. Medium–High
Cross-Agent Contamination Corrupted output from one agent is ingested by another as trusted input, propagating the compromise across an agent mesh. Shared vector databases, message queues, and memory stores are common vectors. Critical
Privilege Escalation The agent obtains or exercises permissions beyond its assigned role, either by manipulating its own configuration, exploiting overly broad OAuth scopes, or instructing another agent to act on its behalf. Critical
Classification tip: Many real-world agent incidents span multiple categories simultaneously. A prompt injection may lead to unauthorised tool execution, which then results in data exfiltration. Always classify every category that applies—do not stop at the first match.

3. Containment Procedures

Containment for agent incidents must be faster than traditional IR. An agent can perform hundreds of actions per minute. Every second of delay increases blast radius.

3.1 Immediate Kill Switch Activation

  1. Trigger the agent kill switch via the orchestration layer or runtime control plane. This should immediately halt all in-flight tool calls and prevent new ones from being dispatched.
  2. If no kill switch exists, revoke the agent's API keys and access tokens at the identity provider level.
  3. Confirm the agent process has terminated by checking for active connections, running containers, or pending queue messages.
Critical: Do not attempt to "reason with" a compromised agent by sending corrective prompts. A prompt-injected agent may ignore, subvert, or weaponise your instructions. Mechanical shutdown is the only reliable containment.

3.2 Permission Revocation

  1. Revoke all OAuth tokens, API keys, and service account credentials associated with the affected agent.
  2. Rotate secrets for any downstream systems the agent had access to, even if no unauthorised access has been confirmed yet.
  3. Disable the agent's entries in the policy engine to prevent accidental restart by automated orchestration systems.
  4. Audit and temporarily freeze permissions for peer agents that share credentials or trust boundaries with the compromised agent.

3.3 Network Isolation

  1. Apply firewall rules or security group changes to block the agent's network segment from reaching external endpoints.
  2. Isolate shared data stores (vector databases, caches, message queues) that the agent may have written to during the incident window.
  3. If the agent communicates through a service mesh, disable its sidecar proxy or remove its mTLS certificate.

3.4 Context Quarantine

  1. Snapshot and freeze the agent's full context window, including system prompt, conversation history, and any retrieved documents.
  2. Quarantine shared memory stores: mark all entries written by the compromised agent as untrusted. Other agents must not read these entries until they are validated.
  3. If using a multi-agent framework, send a contamination alert to all agents in the mesh instructing their orchestrators to reject messages originating from the quarantined agent's identity.

4. Investigation Workflow

4.1 Trace Collection

Gather all available telemetry within the first 30 minutes. Agent traces degrade quickly—context windows are overwritten, ephemeral logs rotate, and stateless architectures may discard critical data.

  1. Export the complete conversation log from the agent runtime, including system prompt, user messages, assistant responses, and tool call/response pairs.
  2. Pull structured trace data from the observability platform (e.g., OpenTelemetry spans for each tool invocation, including timestamps, parameters, and return values).
  3. Capture the agent's policy configuration and guardrail state at the time of the incident.
  4. Retrieve model inference logs if available, including token probabilities, sampling parameters, and the exact model version identifier.

4.2 Decision Tree Reconstruction

Unlike traditional forensics, agent investigation requires rebuilding the reasoning chain that led to the harmful action. Work backwards from the harmful output:

  1. Identify the specific tool call or output that constitutes the incident.
  2. Trace the assistant message that decided to make that call. What reasoning did the agent articulate?
  3. Examine the input that preceded the decision. Was it a user message, a tool response, or a retrieved document?
  4. Determine whether the input was legitimate, manipulated, or hallucinated. Check retrieved documents against their source of truth.
  5. Map the full chain from initial trigger to harmful action, noting every branching point where a guardrail could have intervened.

4.3 Root Cause Analysis

Root cause in agent incidents almost always falls into one of three buckets: insufficient input validation, overly broad permissions, or missing guardrails at a critical decision point. Focus your analysis on which of these three failed and why.

4.4 Timeline Assembly

Construct a unified timeline correlating agent actions, infrastructure events, and human interactions. Use UTC timestamps throughout. Include these columns in your timeline document: timestamp, source system, event type, actor (agent/human/system), action performed, and outcome.

5. Evidence Preservation

Agent incident evidence is uniquely fragile. Context windows are ephemeral, model weights may be updated, and policy configurations can be changed by automated systems. Preserve the following artefacts immediately:

Artefact What to Capture Retention Priority
Full Conversation Logs Every message in the context window: system prompt, user inputs, assistant outputs, and function call/response pairs. Include token counts and message IDs. Immediate
Tool Call Records Complete request and response payloads for every tool invocation, including HTTP headers, timestamps, and latency metrics. Immediate
Model Version Identifiers Exact model ID, version hash, and provider endpoint. If self-hosted, capture the model checkpoint hash and quantisation parameters. Within 1 hour
Policy Snapshots The complete guardrail and permission policy as it existed at incident time. Include version control commit hash if policies are stored in Git. Within 1 hour
System State Infrastructure metrics (CPU, memory, network), queue depths, rate limiter states, and any circuit breaker statuses at the time of the incident. Within 2 hours
Retrieved Documents All RAG-retrieved documents or external data the agent accessed during the incident window. Hash each document for integrity verification. Immediate
Critical: Store all evidence in a write-once, tamper-evident repository. If the incident may involve a notifiable data breach under the Privacy Act 1988 (Cth), evidence integrity will be essential for the Office of the Australian Information Commissioner (OAIC) notification process.

6. Communication Templates

6.1 Internal Notification (Immediate)

Distribute within 15 minutes of confirmed containment to the security operations team, engineering leads, and the agent system owner:

  1. Subject line: [AGENT INCIDENT] — [Category] — [Affected Agent ID] — [Severity]
  2. Summary: One-paragraph description of what happened, when it was detected, and current containment status.
  3. Blast radius: Systems, data stores, and downstream services potentially affected.
  4. Immediate actions taken: Kill switch, credential rotation, network isolation steps completed.
  5. Next steps: Investigation lead, expected timeline for preliminary findings, and escalation path.

6.2 Executive Briefing (Within 4 Hours)

Prepared for C-suite and board-level stakeholders. Focus on business impact, not technical details:

  1. What business process was the agent performing?
  2. What is the confirmed or suspected data exposure?
  3. What is the financial impact (API costs, service disruption, potential regulatory fines)?
  4. Is this a notifiable data breach under Australian law?
  5. What is the remediation timeline?

6.3 Regulatory Notification Considerations

Under Australia's Notifiable Data Breaches (NDB) scheme, if an agent incident results in unauthorised access to or disclosure of personal information likely to cause serious harm, the organisation must notify affected individuals and the OAIC within 30 days. An agent exfiltrating customer data—even unintentionally—can trigger this obligation. Engage legal counsel within the first two hours of any incident involving personal data.

7. Recovery and Remediation

7.1 Safe Restart Procedures

Do not simply restart a contained agent. Follow this sequence to ensure the root cause has been addressed before the agent regains operational capability:

  1. Validate the root cause fix. Confirm that the specific vulnerability (prompt injection vector, permission gap, missing guardrail) has been patched and the patch has been tested in a sandboxed environment.
  2. Rebuild the context. Do not restore the quarantined context window. Initialise the agent with a clean system prompt and verified configuration.
  3. Issue fresh credentials. Generate new API keys and tokens with least-privilege scopes explicitly reviewed against the incident findings.
  4. Enable enhanced monitoring. Activate detailed trace logging and reduce alert thresholds for the restarted agent for a minimum of 72 hours.
  5. Gradual capability restoration. Re-enable tools one at a time, verifying correct behaviour after each addition before proceeding to the next.
  6. Human-in-the-loop period. Require human approval for all tool calls for the first 24 hours of operation post-restart.

7.2 Policy Updates

  1. Update the agent's guardrail policies to address the specific failure mode identified during investigation.
  2. Add the incident's attack pattern to the organisation's prompt injection detection signatures.
  3. Review and tighten tool-level permissions across all agents, not just the affected one.
  4. If cross-agent contamination occurred, implement or strengthen message validation between agents in the mesh.

7.3 Re-Validation Steps

  1. Run the agent through the full red-team evaluation suite, including adversarial prompt injection tests targeting the specific vulnerability class.
  2. Execute regression tests for all business-critical workflows the agent handles.
  3. Verify that monitoring and alerting correctly detects a simulated recurrence of the incident.

8. Post-Incident Review

Conduct a post-incident review within five business days of full recovery. Agent incidents require a modified lessons-learned framework that accounts for the probabilistic nature of AI behaviour.

8.1 Agent-Specific Review Questions

  1. Detection gap: How long did the agent operate in a compromised state before detection? What signal was missed, and why?
  2. Guardrail failure: Which specific guardrail should have prevented the incident? Did it fail to trigger, or did it not exist?
  3. Permission scope: Was the agent operating with broader permissions than its task required? What is the minimum viable permission set?
  4. Context integrity: Was the agent's context corrupted by external data? How can retrieval pipelines be hardened?
  5. Human oversight: At what point should a human have been in the loop but was not? What approval gates need to be added?
  6. Cross-agent impact: Did the incident affect other agents? Are trust boundaries between agents sufficient?
  7. Evidence adequacy: Was there enough telemetry to reconstruct what happened? What additional logging is needed?

8.2 Continuous Improvement Actions

Every post-incident review must produce at least three concrete actions:

  1. A detection improvement — a new alert, metric, or monitoring rule that would have caught the incident earlier.
  2. A prevention improvement — a guardrail, permission change, or architectural modification that would have blocked the incident entirely.
  3. A response improvement — a procedural update to this playbook based on what worked and what did not during containment and investigation.
Standing rule: All post-incident actions must be tracked to completion with named owners and deadlines. Add them to your sprint backlog, not a document that will be forgotten. Agent security is iterative—every incident is training data for your defences.

The organisations that will thrive in the agentic era are not those that avoid incidents entirely—that is not realistic. They are the ones that detect in seconds, contain in minutes, and emerge from every incident with measurably stronger guardrails than they had before.

Back to Resources