A Safe Harbor for AI Evaluation and Red Teaming

Shayne Longpre; Sayash Kapoor; Kevin Klyman; Ashwin Ramaswami; Rishi Bommasani; Borhane Blili-Hamelin; Yangsibo Huang; Aviya Skowron; Zheng-Xin Yong; Suhas Kotha; Yi Zeng; Weiyan Shi; Xianjun Yang; Reid Southen; Alexander Robey; Patrick Chao; Diyi Yang; Ruoxi Jia; Daniel Kang; Sandy Pentland; Arvind Narayanan; Percy Liang; Peter Henderson

A Safe Harbor for AI Evaluation and Red Teaming

Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, Yi Zeng, Weiyan Shi, Xianjun Yang, Reid Southen, Alexander Robey, Patrick Chao, Diyi Yang, Ruoxi Jia, Daniel Kang, Sandy Pentland, Arvind Narayanan, Percy Liang, Peter Henderson

TL;DR

The paper argues that independent evaluation and red-teaming of deployed generative AI are hampered by restrictive terms of service and fear of enforcement. It proposes two voluntary safe harbors—a legal safe harbor to shield good-faith research from civil liability and a technical safe harbor to protect researchers from account suspensions—along with mechanisms such as trusted intermediaries (NAIRR) and transparent appeals. These contributions formalize a governance framework to expand public-interest AI safety research while preserving safeguards against misuse. If adopted by major AI developers, the safe harbors could broaden participation, improve trust, and enhance accountability in the evaluation of high-risk AI systems.

Abstract

Independent evaluation and red teaming are critical for identifying the risks posed by generative AI systems. However, the terms of service and enforcement strategies used by prominent AI companies to deter model misuse have disincentives on good faith safety evaluations. This causes some researchers to fear that conducting such research or releasing their findings will result in account suspensions or legal reprisal. Although some companies offer researcher access programs, they are an inadequate substitute for independent research access, as they have limited community representation, receive inadequate funding, and lack independence from corporate incentives. We propose that major AI developers commit to providing a legal and technical safe harbor, indemnifying public interest safety research and protecting it from the threat of account suspensions or legal reprisal. These proposals emerged from our collective experience conducting safety, privacy, and trustworthiness research on generative AI systems, where norms and incentives could be better aligned with public interests, without exacerbating model misuse. We believe these commitments are a necessary step towards more inclusive and unimpeded community efforts to tackle the risks of generative AI.

A Safe Harbor for AI Evaluation and Red Teaming

TL;DR

Abstract

Paper Structure (19 sections, 1 figure, 4 tables)

This paper contains 19 sections, 1 figure, 4 tables.

Introduction
Background & Motivations
Avoiding the Fate of Social Media Platforms
Prominent social media platforms block researcher access to the detriment of public interests.
Conducting research on generative AI comes with additional challenges compared to social media.
The Importance of Independent AI Evaluation
Concerns over the risks and harms of generative AI are mounting.
Independent AI evaluation and red teaming are crucial for uncovering vulnerabilities, before they proliferate.
Challenges to Independent AI Evaluation
Safe Harbors
A Legal Safe Harbor
A Technical Safe Harbor
Related Proposals
Conclusion
Additional Considerations & Future Work
...and 4 more sections

Figures (1)

Figure 1: A summary of the suggested mutual commitments and scope of a legal safe harbor, and technical safe harbor. These commitments extend existing safe harbors for security research as well as researcher access programs, and are written in the context of US laws. For a wider list of common researcher responsibilities consider https://bugcrowd.com/openai.

A Safe Harbor for AI Evaluation and Red Teaming

TL;DR

Abstract

A Safe Harbor for AI Evaluation and Red Teaming

Authors

TL;DR

Abstract

Table of Contents

Figures (1)