Red-Teaming for Generative AI: Silver Bullet or Security Theater?
Michael Feffer, Anusha Sinha, Wesley Hanwen Deng, Zachary C. Lipton, Hoda Heidari
TL;DR
Generative AI red-teaming is widely promoted as a risk mitigation tool, but definitions, scopes, and regulatory utility are unclear. The authors fuse case studies with a literature survey to map threat models, artifacts, lifecycles, and reporting practices, exposing substantial heterogeneity and concerns about security theater. They argue red-teaming is valuable but not sufficient alone, and they offer a structured question bank to standardize future evaluation practices. The analysis of public NIST RFI comments aligns with these findings and underscores the need for clearer definitions, broader stakeholder involvement, and standardized reporting in governance.
Abstract
In response to rising concerns surrounding the safety, security, and trustworthiness of Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red-teaming as a key component of their strategies for identifying and mitigating these risks. However, despite AI red-teaming's central role in policy discussions and corporate messaging, significant questions remain about what precisely it means, what role it can play in regulation, and how it relates to conventional red-teaming practices as originally conceived in the field of cybersecurity. In this work, we identify recent cases of red-teaming activities in the AI industry and conduct an extensive survey of relevant research literature to characterize the scope, structure, and criteria for AI red-teaming practices. Our analysis reveals that prior methods and practices of AI red-teaming diverge along several axes, including the purpose of the activity (which is often vague), the artifact under evaluation, the setting in which the activity is conducted (e.g., actors, resources, and methods), and the resulting decisions it informs (e.g., reporting, disclosure, and mitigation). In light of our findings, we argue that while red-teaming may be a valuable big-tent idea for characterizing GenAI harm mitigations, and that industry may effectively apply red-teaming and other strategies behind closed doors to safeguard AI, gestures towards red-teaming (based on public definitions) as a panacea for every possible risk verge on security theater. To move toward a more robust toolbox of evaluations for generative AI, we synthesize our recommendations into a question bank meant to guide and scaffold future AI red-teaming practices.
