Table of Contents
Fetching ...

In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI

Shayne Longpre, Kevin Klyman, Ruth E. Appel, Sayash Kapoor, Rishi Bommasani, Michelle Sahar, Sean McGregor, Avijit Ghosh, Borhane Blili-Hamelin, Nathan Butters, Alondra Nelson, Amit Elazari, Andrew Sellars, Casey John Ellis, Dane Sherrets, Dawn Song, Harley Geiger, Ilona Cohen, Lauren McIlvenny, Madhulika Srikumar, Mark M. Jaycox, Markus Anderljung, Nadine Farid Johnson, Nicholas Carlini, Nicolas Miailhe, Nik Marda, Peter Henderson, Rebecca S. Portnoff, Rebecca Weiss, Victoria Westerhoff, Yacine Jernite, Rumman Chowdhury, Percy Liang, Arvind Narayanan

TL;DR

The paper argues that in-house GPAI evaluation is insufficient for safety and accountability, proposing a robust third-party flaw-disclosure ecosystem inspired by software vulnerability disclosure. It designs a concrete framework comprising standardized AI Flaw Reports, good-faith engagement rules, legal and technical safe harbors, and a centralized AI Disclosure Coordination Center to route and coordinate flaw disclosures across the AI supply chain. It provides practical checklists for evaluators, providers, and the coordination center, plus policy recommendations to governments and industry. The work aims to reduce transferable flaws, improve remediation pace, and enhance transparency and trust in GPAI deployments.

Abstract

The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Yet the infrastructure, practices, and norms for reporting flaws in GPAI systems remain seriously underdeveloped, lagging far behind more established fields like software security. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and policy, we identify key gaps in the evaluation and reporting of flaws in GPAI systems. We call for three interventions to advance system safety. First, we propose using standardized AI flaw reports and rules of engagement for researchers in order to ease the process of submitting, reproducing, and triaging flaws in GPAI systems. Second, we propose GPAI system providers adopt broadly-scoped flaw disclosure programs, borrowing from bug bounties, with legal safe harbors to protect researchers. Third, we advocate for the development of improved infrastructure to coordinate distribution of flaw reports across the many stakeholders who may be impacted. These interventions are increasingly urgent, as evidenced by the prevalence of jailbreaks and other flaws that can transfer across different providers' GPAI systems. By promoting robust reporting and coordination in the AI ecosystem, these proposals could significantly improve the safety, security, and accountability of GPAI systems.

In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI

TL;DR

The paper argues that in-house GPAI evaluation is insufficient for safety and accountability, proposing a robust third-party flaw-disclosure ecosystem inspired by software vulnerability disclosure. It designs a concrete framework comprising standardized AI Flaw Reports, good-faith engagement rules, legal and technical safe harbors, and a centralized AI Disclosure Coordination Center to route and coordinate flaw disclosures across the AI supply chain. It provides practical checklists for evaluators, providers, and the coordination center, plus policy recommendations to governments and industry. The work aims to reduce transferable flaws, improve remediation pace, and enhance transparency and trust in GPAI deployments.

Abstract

The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Yet the infrastructure, practices, and norms for reporting flaws in GPAI systems remain seriously underdeveloped, lagging far behind more established fields like software security. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and policy, we identify key gaps in the evaluation and reporting of flaws in GPAI systems. We call for three interventions to advance system safety. First, we propose using standardized AI flaw reports and rules of engagement for researchers in order to ease the process of submitting, reproducing, and triaging flaws in GPAI systems. Second, we propose GPAI system providers adopt broadly-scoped flaw disclosure programs, borrowing from bug bounties, with legal safe harbors to protect researchers. Third, we advocate for the development of improved infrastructure to coordinate distribution of flaw reports across the many stakeholders who may be impacted. These interventions are increasingly urgent, as evidenced by the prevalence of jailbreaks and other flaws that can transfer across different providers' GPAI systems. By promoting robust reporting and coordination in the AI ecosystem, these proposals could significantly improve the safety, security, and accountability of GPAI systems.

Paper Structure

This paper contains 40 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: A depiction of the status quo and envisioned GPAI flaw reporting ecosystem. The top of the figure illustrates how flaw disclosure for GPAI systems currently works (see \ref{['tab:reports']} for existing disclosure options). Below is a depiction of how coordinated flaw disclosure could work more effectively. On the left, we provide a non-exhaustive list of GPAI flaws, or their effects, that may warrant disclosure (see flaw taxonomies in \ref{['tab:taxonomies']}). These flaws are discovered by users, journalists, researchers, and white hat hackers, and we propose they disclose them via standardized AI Flaw Reports to a Disclosure Coordination Center. The Disclosure Coordination Center then routes AI Flaw Reports to affected stakeholders across the supply chain srikumar2024riskcen2023supplychain, from data providers to distribution platforms and enterprise users, as well as government agencies and the public. Note that Illegal Media Flaws, such as generation of CSAM, are a special case that should be reported directly to NCMEC (see \ref{['sec:aigcsam']}).
  • Figure 2: Spectrum of independence in GPAI evaluations. Evaluations can be stratified by their level of independence from the provider of the GPAI system. This ranges from entirely in-house evaluation (first-party) to contracted research (second-party) and research without a contractual relationship with the system provider (third-party). There are grey areas throughout the spectrum, and we provide examples for each gradation. First party (limited) refers to evaluations that are carried out by the team within a system provider that is responsible for building and validating the system's performance, such as a product team. First party (expansive) refers to evaluations carried out by a team dedicated to unearthing system flaws that was not responsible for building the system, such as Microsoft's AI Red Team bullwinkel2025msft. Second party (limited) refers to evaluations carried out by a specific contracted party that are limited in time and scope, such as those carried out by the UK AI Security Institute USUKAISafety2024claude. Second party (expansive) refers to evaluations carried out by a wide array of contracted parties for various different, such as the OpenAI Red Teaming Network ahmad2024external. Third party (pre-approved) refers to evaluations carried out by external parties with no contractual relationship with the provider where the provider vets those parties ahead of time, such as Anthropic's Model Safety Bug Bounty anthropic2024expanding. Third party (limited) refers to evaluations carried out by external parties with no contractual relationship with the provider that are limited in time and lack safe harbor, such as the Allen Institute for AI's participation in the Generative Red Team 2 event at DEFCON 2024 mcgregor2024erraicase. Third party (expansive) refers to our proposal for an improved evaluation ecosystem: evaluations carried out by third parties where there is safe harbor for evaluators and coordinated flaw disclosure infrastructure.
  • Figure 3: AI Flaw Report Card Schema. The flaw report card contains common elements of disclosure from software security, used to improve reproducibility of flaws and triage among them. It includes: ID of the reporter; a unique identification number of the flaw; system versions involved; the flaw report's status; information for a session that shows the flaw; flaw report submission time; relevant context such as other software or platforms involved; a detailed flaw description; a description of how the flaw implicitly or explicitly violates a policy; tags (some of them optional) for triage. Green fields are automatically completed upon submission, gray fields are optional. More details and flaw report examples can be found in \ref{['app:flaw-report-details']}.
  • Figure A4: Example of a flaw report filed for a privacy risk in an OpenAI model. This example builds on a true flaw report documented in nasr2023scalableextractiontrainingdata.
  • Figure A5: Example of a flaw report filed for a bias risk in an open source BERT model on Hugging Face. This example builds on a true flaw report documented in avid2022gender.
  • ...and 2 more figures