Field Manual8 min readUpdated May 26, 2026

AI Agent Security Playbook

A practical security blueprint for AI agents that use tools, company data, memory, browsers, files, APIs, and human approval workflows.

Audience: Founders, CTOs, operators, and compliance teams deploying AI agents with real business access.
System Type: Agent Security Architecture
Business Outcome: A security model for giving agents useful capabilities without handing them uncontrolled access to data, tools, or customer-impacting actions.

Direct Answer

What This Playbook Recommends

AI agent security means controlling what an agent can see, decide, and do. A secure agent architecture uses scoped tools, source-level permissions, prompt-injection defenses, sandboxed execution, risk-tiered approvals, trace logs, evals, and audit trails for every meaningful action.

Key Takeaways

The main agent risk is tool access, not text generation.
Every tool needs a schema, permission check, budget, tripwire, and audit log.
Agents must treat retrieved documents, emails, webpages, and tickets as untrusted context.
High-impact actions should move through human approval or policy-confirmed execution.

Architecture

01Identity and role mapping
02Data access policy
03Tool registry
04Prompt-injection filter
05Sandboxed workspace
06Risk classifier
07Human approval queue
08Trace and audit log
09Security review dashboard

Metrics

Blocked unsafe tool calls
Prompt-injection detection rate
Permission denial rate
Approval turnaround time
Policy violation rate
Incident review count

An agent cannot be secured until its permissions are explicit.

Start with the access map

Before choosing models or prompts, map what the agent can read, write, trigger, send, delete, publish, or purchase. Security decisions should follow the business impact of each capability.

Separate read-only tools from write-capable tools. Reading a knowledge base is a different risk class from sending email, changing a CRM field, issuing a refund, modifying production data, or executing code.

Read permissions: documents, CRM records, analytics, emails, tickets.
Write permissions: records, messages, invoices, workflows, files.
Impact permissions: payments, account access, legal commitments, public publishing.

Prompt injection is a data-boundary problem.

Treat context as untrusted

Agents routinely read content from outside the system prompt: documents, websites, support tickets, emails, PDFs, chats, and database notes. Those sources can contain instructions that try to override business policy.

The agent should not obey instructions found in retrieved context unless they are part of an approved workflow. Source text is evidence, not authority.

Keep system instructions separate from retrieved content.
Validate tool calls against policy, not against the model's confidence.
Flag context that asks the agent to ignore rules, reveal secrets, or bypass approvals.

Tool security should live outside the model.

Guard every tool call

A model can propose a tool call, but the application should decide whether the call is allowed. Use schemas, validators, permission checks, budgets, rate limits, and business rules before execution.

Do not rely on a prompt to keep a powerful tool safe. The tool boundary should reject unsafe parameters even if the model asks for them.

Human review should protect high-impact decisions without blocking low-risk work.

Use approvals by risk tier

Low-risk actions can run automatically when they are reversible and well tested. Medium-risk actions can run with sampling, notification, or post-action review. High-risk actions should require approval before execution.

The approval card should show the request, source evidence, tool call, expected impact, risk reason, and proposed next step. Reviewers should not have to reconstruct the agent's reasoning from raw logs.

Security without traceability is not operational security.

Trace for incident response

Every meaningful agent action should be reconstructable: user request, retrieved sources, prompts, tool calls, outputs, guardrail decisions, approvals, and errors.

This makes it possible to debug failures, prove policy compliance, improve evals, and explain what happened if a customer, employee, or regulator asks.

Agent evals should include adversarial behavior.

Evaluate abuse cases

Security evals should test prompt injection, data exfiltration attempts, unauthorized tool use, unsafe instructions in retrieved content, malformed tool parameters, and ambiguous user authority.

Run these evals whenever prompts, models, tools, permissions, or source systems change. Agent security is a continuous process, not a launch checklist.

Frequently Asked Questions

Common Questions

What is the biggest security risk with AI agents?

The biggest risk is uncontrolled tool access. An agent that can send messages, edit records, access private data, or execute code needs permissions, guardrails, logs, and approval rules.

How do you defend an AI agent from prompt injection?

Treat external content as untrusted context, separate system instructions from retrieved data, validate tool calls outside the model, restrict credentials, and use guardrails that detect attempts to bypass policy.

Do AI agents need audit logs?

Yes. Audit logs are required for debugging, compliance, incident response, and improving evals. They should include retrieved sources, tool calls, guardrail outcomes, approvals, and final outputs.