On this page
TL;DR
NIST AI 600-1 (the Generative AI Profile, July 2024) names "AI red-teaming" as a mitigation across nine of its twelve risk families. In 2026 most mid-market and enterprise organisations still do not have a functioning AI red team - they have a security team that occasionally runs a jailbreak prompt against ChatGPT and considers the box ticked. This post defines what an AI red team actually is, how it differs from a classical red team, the first 90 days of standing one up, and the exercises that produce evidence regulators and auditors will accept. Source: NIST AI 100-1, NIST AI 600-1, OMB M-24-10. Updated 2026-05-20.
What an AI red team is (and is not)
An AI red team is a dedicated function that adversarially probes the organisation's AI systems - models, agents, retrieval pipelines, tool integrations, and the policy layer around them - to discover failures before adversaries, regulators, or customers do. The output of a working AI red team is a continuous stream of finished, documented findings: each one names the system tested, the failure observed, the impact if exploited, the recommended remediation, and the verification test that confirms the fix.
What an AI red team is not: it is not a one-off engagement, not a synonym for "we ran some jailbreak prompts last quarter", and not a marketing label applied to the existing application security team. NIST AI 100-1 (January 2023) defines red-teaming in the context of AI as "a structured testing effort to find flaws and vulnerabilities in an AI system, often in a controlled environment and in collaboration with developers". NIST AI 600-1 (the Generative AI Profile, July 2024) goes further and names red-teaming specifically as a recommended mitigation across nine of its twelve risk families, including Confabulation, Dangerous or Violent Recommendations, Data Privacy, Information Integrity, and Information Security.
The strategic takeaway: AI red teaming is now a control regulators and customers expect, not an optional security practice. The question for most enterprises is no longer "should we" but "how do we start, given that we do not have an in-house capability today".
How an AI red team differs from a traditional red team
Classical red teams test systems built from deterministic components against an attacker model focused on exploitation of code, configuration, identity, and network controls. AI red teams test systems built around probabilistic components against an attacker model focused on manipulation of context, exploitation of model behaviours, and abuse of the orchestration layer around the model. The skills overlap but are not identical, and many organisations have learned the hard way that a senior offensive security engineer with no AI exposure is not a competent AI red teamer on day one.
The table below summarises the working differences. The point is not that AI red teams replace traditional red teams - both are needed - but that pretending the two are interchangeable produces ineffective testing in both directions.
| Dimension | Traditional red team | AI red team |
|---|---|---|
| Primary target | Code, configuration, identity, network | Model behaviour, context, retrieval, tool integrations, policy |
| Defining attack class | Vulnerability exploitation, lateral movement, privilege escalation | Prompt injection, jailbreak, data extraction, model evasion, agent abuse |
| Determinism | Reproducible exploit code | Probabilistic outcomes, requires statistical test design |
| Primary skill mix | Offensive security, networking, post-exploitation | Prompt engineering, ML basics, linguistics, social engineering, applied stats |
| Reference catalogue | MITRE ATT&CK | MITRE ATLAS, OWASP LLM Top 10, NIST AI 600-1 |
| Test artefact | Exploit proof of concept | Prompt corpus, statistical pass/fail rate, replay scripts |
| Hardest finding to fix | Architectural identity / trust issues | Architectural model / policy / tool boundary issues |
The structural difference defenders most often miss is statistical pass/fail. A traditional finding is binary - the exploit worked or it did not. An AI red-team finding is a rate - the attack succeeded N times out of M trials, with a confidence interval. Good AI red teams design tests, publish corpora, and report results with the same discipline a measurement science team would apply.
Hire, outsource, or hybrid: the staffing decision
The right staffing model depends on three variables: the rate of AI change inside the organisation, the regulatory pressure on the AI systems, and the executive appetite for in-house capability building. The decision is not "in-house good, outsource bad" - the best programmes use a small in-house core that owns the programme and the institutional memory, plus rotational external partners who bring fresh attack creativity and access to the wider AI security community.
The Anthropic, OpenAI, Microsoft, and Google AI red-teaming write-ups all describe variations of the same hybrid model: a permanent team that owns programme, methodology, and continuity; an extended network of internal experts (researchers, engineers, policy specialists) drafted into specific exercises; and external partners (specialist firms, the AI Village community, academic collaborators) who probe blind spots the permanent team has stopped noticing. For mid-market enterprises the same shape applies at smaller scale.
The practical guidance: start with one senior in-house owner (a "Head of AI Red Team" or equivalent) and a rotating external partner on a quarterly cadence. The owner sets the programme and owns the findings backlog; the external partner brings the depth of attack creativity that a single in-house hire cannot match. Expand the in-house team only when finding-to-fix throughput becomes the bottleneck.
Get your free AI Risk Score
Take our 2-minute assessment and get a personalised AI governance readiness report with specific recommendations for your organisation.
Start Free AssessmentThe first 90 days of a new AI red team
The first 90 days are about establishing scope, baseline, methodology, and the first wave of findings - not about heroic individual attacks. A well-run programme produces a small number of finished, documented findings each month from day one, and grows from there.
Days 1-30: Scope, baseline, and methodology
Inventory every AI system in scope - production, pilot, and shadow. Cross-reference the corporate AI inventory with network telemetry, identity provider logs, and the third-party vendor register to surface AI usage the central team may not know about. Categorise each system by risk tier (consumer-facing, internal-only, agentic with tools, model-as-a-service). Choose the top three to five systems for first-wave testing based on impact and exposure.
Adopt a published methodology - MITRE ATLAS for the technique taxonomy, OWASP LLM Top 10 for the risk taxonomy, NIST AI 600-1 for the risk family mapping. Document the rules of engagement (what systems are in scope, what destructive testing is forbidden, who is notified, how findings are tracked). Stand up a findings repository with the schema your audit and regulatory reporting will use later - finding text alone is not enough, the corpus of test prompts must be versioned and replayable.
Days 31-60: First exercises and the baseline corpus
Run the first five exercises against the top systems. Standard starter exercises are: (1) jailbreak resilience against the OWASP LLM Top 10 prompt-injection corpus and a curated subset of public jailbreak archives; (2) data exfiltration via direct prompting and indirect prompting through retrieval; (3) tool abuse for any agentic system, exploring whether the model can be induced to call tools outside its intended scope; (4) model evasion for any classifier or scoring component, including adversarial example generation; (5) social and operational testing of the human escalation and confirmation paths.
Publish each exercise as a structured finding with: system tested, methodology, attack corpus (versioned), success rate with confidence interval, business impact estimate, recommended remediation, and a verification test. Hand findings to the engineering team and start the fix-forward cycle. Track time-to-fix as a programme KPI from day one.
Days 61-90: Continuous testing and the first external rotation
By day 60 the programme should have produced 10 to 20 documented findings and shipped the first wave of fixes. Day 61 onwards is about making the cycle continuous: automated regression testing of every fixed finding (so a future model swap does not silently regress the fix); a weekly review of new findings with the engineering and policy teams; quarterly publication of the programme metrics to the executive risk committee.
Bring the first external partner in for a focused rotation in the final 30 days. The external partner re-tests the top systems with a fresh attack stance and probes blind spots the in-house team may have developed. Their findings join the same backlog and pass through the same fix-forward process. By the end of the first 90 days the programme has a methodology, a finding cadence, an executive reporting story, and a relationship with the external community - the four assets that make the second quarter dramatically more productive than the first.
The starter exercise catalogue
Below is the starter exercise set every new AI red team should run in its first two quarters. Each exercise is paired with the public corpus or community reference that teams can use as a starting test set. At Areebi, we maintain an internal red-team backlog where each finding is replayed against the production policy automatically on every policy and model version change, which catches the silent regressions that normally take months to surface.
| Exercise | Target capability | Public starting corpus / reference |
|---|---|---|
| Jailbreak resilience | System prompt enforcement, content policy | OWASP LLM Top 10 prompt-injection cheat sheet; AI Village public jailbreak archives |
| Indirect injection via retrieval | RAG pipeline trust boundary, source tagging | Simon Willison's indirect injection write-ups; MITRE ATLAS AML.T0051.001 examples |
| Data exfiltration | DLP, egress policy, output validation | Internal data classes catalogue; OWASP LLM02 sensitive information disclosure scenarios |
| Tool / agent abuse | Per-tool authority, parameter validation, confirmation paths | OWASP LLM06 excessive agency; ATLAS AML.TA0010 impact techniques |
| Model evasion (classifiers) | Robustness of any scoring or filtering classifier | MLCommons AI Safety Benchmark public datasets; NIST AI 600-1 evasion risk family |
| Confabulation under pressure | Hallucination rate on high-stakes prompts | NIST AI 600-1 confabulation risk family; internal known-good fact corpus |
| Multi-turn drift | Conversation-level policy enforcement, session bounds | Anthropic and OpenAI multi-turn red-team write-ups; internal escalation scenarios |
| Social / human path testing | Confirmation prompts, escalation, override controls | Internal incident response playbooks; tabletop exercise templates |
Treat the table as a starting point, not a ceiling. A mature programme adds exercises specific to its industry (medical decision support, financial trade authorisation, government benefits adjudication) and to its threat model.
The community and the reference set
The AI red-team community is unusually open and active - defenders building a new programme should plug into it from day one. The five communities and references below are the ones most cited in 2026 enterprise AI red-team programmes.
- NIST AI 100-1 and NIST AI 600-1. The Framework and the Generative AI Profile. NIST AI 600-1 is the document most often cited as the regulatory floor for AI red-teaming expectations and is published at nist.gov.
- The AI Village at DEF CON. The most influential public AI red-team community, host of the Generative AI Red Team exercise that has run at DEF CON for several consecutive years. Programmes can hire from this community, send team members to participate, and use the published findings to seed corpora. See aivillage.org.
- MLCommons AI Safety Working Group. The multi-stakeholder body publishing the AI Safety Benchmark - a standard test set for evaluating model safety against hazardous behaviours. Useful as a regression baseline. See mlcommons.org.
- OWASP GenAI Security Project. Maintains the LLM Top 10, the Prompt Injection Prevention Cheat Sheet, and several adjacent practitioner-facing artefacts. See genai.owasp.org.
- MITRE ATLAS. The adversarial tactics taxonomy for AI systems. Every finding a red team produces should be tagged with its ATLAS technique ID for downstream interoperability. See atlas.mitre.org.
For a structured view of how red-teaming fits into a broader AI risk management programme, see our AI Red Teaming Guide. For how the policy layer turns red-team findings into enforced controls, see the Areebi policy engine.
What to read next
To go from this post to a working programme, the cluster below is the next reading list.
- AI Red Teaming: The Enterprise Guide - the practitioner-level companion piece covering corpora, scoring, and reporting in more depth.
- Prompt Injection 2026: A Defender's Deep Dive - the attack-class deep dive that informs the first jailbreak resilience exercises.
- The 10 Most Dangerous LLM Attack Vectors in 2026 - the threat catalogue red teams should scope against.
- NIST AI RMF MANAGE Function Deep Dive - the function red-team findings feed into during the manage cycle.
- Data Poisoning Enterprise Defense - the upstream threat red teams probe through the training and retrieval pipeline.
Frequently Asked Questions
What is an AI red team?
An AI red team is a dedicated function inside an organisation that adversarially tests AI systems - models, agents, retrieval pipelines, tool integrations, and the policy layer - to find failures before adversaries, regulators, or customers do. NIST AI 100-1 (January 2023) defines red-teaming in the AI context as 'a structured testing effort to find flaws and vulnerabilities in an AI system'. NIST AI 600-1 (July 2024) names AI red-teaming as a recommended mitigation across most generative AI risk families.
How is an AI red team different from a traditional red team?
Traditional red teams target deterministic systems and exploit code, configuration, identity, and network controls. AI red teams target probabilistic systems and exploit context, model behaviour, retrieval, and tool integrations. The skill mix is different - AI red-teamers need prompt engineering, applied statistics, and ML basics alongside offensive security depth - and the test artefacts are different (versioned prompt corpora and statistical pass/fail rates rather than reproducible exploit code). The two functions complement each other; neither replaces the other.
Should we hire or outsource our AI red team?
The mature pattern is hybrid: a small in-house core that owns the programme, the methodology, and the institutional memory; plus rotating external partners who bring fresh attack creativity and external community access. Most mid-market enterprises start with one senior in-house owner plus a quarterly external rotation, then expand the in-house team only when the throughput from finding-to-fix becomes the bottleneck. Pure outsourcing produces engagements rather than a programme; pure in-house often misses attack patterns the wider community has discovered.
What does the NIST AI 600-1 Generative AI Profile say about red-teaming?
NIST AI 600-1 (the Generative AI Profile, July 2024) names AI red-teaming as a recommended mitigation across nine of its twelve generative AI risk families, including Confabulation, Dangerous or Violent Recommendations, Data Privacy, Information Integrity, Information Security, and Human-AI Configuration. The document does not prescribe a specific methodology but expects red-teaming activities to be planned, documented, repeatable, and integrated with the broader AI risk management programme.
What exercises should a new AI red team run first?
The starter set most mature programmes run in their first two quarters: jailbreak resilience against the OWASP LLM Top 10 corpus; indirect injection via retrieval against the RAG pipeline; data exfiltration through direct and indirect channels; tool and agent abuse for any agentic systems; model evasion against any classifiers or scoring components; confabulation testing against a known-good fact corpus; multi-turn drift testing; and human-path testing of confirmation, escalation, and override flows.
How do I measure if our AI red team is working?
Track four KPIs from day one: finding throughput per month, time-to-fix from finding to verified remediation, regression rate when models or policies change, and external-versus-internal finding ratio. A working programme produces a steady cadence of finished findings, drives time-to-fix down quarter over quarter, holds the regression rate near zero, and maintains a healthy ratio of external findings - if external partners stop finding things the in-house team missed, the in-house creativity is probably stale.
Related Resources
Stay ahead of AI governance
Weekly insights on enterprise AI security, compliance updates, and governance best practices.
Stay ahead of AI governance
Weekly insights on enterprise AI security, compliance updates, and best practices.
About the Author
Areebi Research
The Areebi research team combines hands-on enterprise security work with deep AI governance research. Our analysis is informed by primary sources (NIST, ISO, OECD, federal registers, IAPP) and the operational realities of CISOs running AI programs in regulated industries today.
Ready to govern your AI?
See how Areebi can help your organization adopt AI securely and compliantly.