AI Agent Security Playbook
A practical security blueprint for AI agents that use tools, company data, memory, browsers, files, APIs, and human approval workflows.
- Audience
- Founders, CTOs, operators, and compliance teams deploying AI agents with real business access.
- System Type
- Agent Security Architecture
- Business Outcome
- A security model for giving agents useful capabilities without handing them uncontrolled access to data, tools, or customer-impacting actions.
Direct Answer
What This Playbook Recommends
AI agent security means controlling what an agent can see, decide, and do. A secure agent architecture uses scoped tools, source-level permissions, prompt-injection defenses, sandboxed execution, risk-tiered approvals, trace logs, evals, and audit trails for every meaningful action.
Key Takeaways
- The main agent risk is tool access, not text generation.
- Every tool needs a schema, permission check, budget, tripwire, and audit log.
- Agents must treat retrieved documents, emails, webpages, and tickets as untrusted context.
- High-impact actions should move through human approval or policy-confirmed execution.
Architecture
- 01Identity and role mapping
- 02Data access policy
- 03Tool registry
- 04Prompt-injection filter
- 05Sandboxed workspace
- 06Risk classifier
- 07Human approval queue
- 08Trace and audit log
- 09Security review dashboard
Metrics
- Blocked unsafe tool calls
- Prompt-injection detection rate
- Permission denial rate
- Approval turnaround time
- Policy violation rate
- Incident review count
An agent cannot be secured until its permissions are explicit.
Start with the access map
Before choosing models or prompts, map what the agent can read, write, trigger, send, delete, publish, or purchase. Security decisions should follow the business impact of each capability.
Separate read-only tools from write-capable tools. Reading a knowledge base is a different risk class from sending email, changing a CRM field, issuing a refund, modifying production data, or executing code.
- Read permissions: documents, CRM records, analytics, emails, tickets.
- Write permissions: records, messages, invoices, workflows, files.
- Impact permissions: payments, account access, legal commitments, public publishing.
Prompt injection is a data-boundary problem.
Treat context as untrusted
Agents routinely read content from outside the system prompt: documents, websites, support tickets, emails, PDFs, chats, and database notes. Those sources can contain instructions that try to override business policy.
The agent should not obey instructions found in retrieved context unless they are part of an approved workflow. Source text is evidence, not authority.
- Keep system instructions separate from retrieved content.
- Validate tool calls against policy, not against the model's confidence.
- Flag context that asks the agent to ignore rules, reveal secrets, or bypass approvals.
Tool security should live outside the model.
Guard every tool call
A model can propose a tool call, but the application should decide whether the call is allowed. Use schemas, validators, permission checks, budgets, rate limits, and business rules before execution.
Do not rely on a prompt to keep a powerful tool safe. The tool boundary should reject unsafe parameters even if the model asks for them.
Human review should protect high-impact decisions without blocking low-risk work.
Use approvals by risk tier
Low-risk actions can run automatically when they are reversible and well tested. Medium-risk actions can run with sampling, notification, or post-action review. High-risk actions should require approval before execution.
The approval card should show the request, source evidence, tool call, expected impact, risk reason, and proposed next step. Reviewers should not have to reconstruct the agent's reasoning from raw logs.
Security without traceability is not operational security.
Trace for incident response
Every meaningful agent action should be reconstructable: user request, retrieved sources, prompts, tool calls, outputs, guardrail decisions, approvals, and errors.
This makes it possible to debug failures, prove policy compliance, improve evals, and explain what happened if a customer, employee, or regulator asks.
Agent evals should include adversarial behavior.
Evaluate abuse cases
Security evals should test prompt injection, data exfiltration attempts, unauthorized tool use, unsafe instructions in retrieved content, malformed tool parameters, and ambiguous user authority.
Run these evals whenever prompts, models, tools, permissions, or source systems change. Agent security is a continuous process, not a launch checklist.
Frequently Asked Questions
Common Questions
What is the biggest security risk with AI agents?
The biggest risk is uncontrolled tool access. An agent that can send messages, edit records, access private data, or execute code needs permissions, guardrails, logs, and approval rules.
How do you defend an AI agent from prompt injection?
Treat external content as untrusted context, separate system instructions from retrieved data, validate tool calls outside the model, restrict credentials, and use guardrails that detect attempts to bypass policy.
Do AI agents need audit logs?
Yes. Audit logs are required for debugging, compliance, incident response, and improving evals. They should include retrieved sources, tool calls, guardrail outcomes, approvals, and final outputs.