Table of Contents
Fetching ...

AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies

Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, Bo Li

TL;DR

AIR-Bench 2024 introduces a regulation-aligned AI safety benchmark built on the AIR 2024 taxonomy, consolidating risks from 8 government regulations and 16 company policies into a four-tier structure with 314 granular level-4 risks. It generates 5,694 prompts across these risks and uses a three-level autograder within the HELM framework to evaluate 22 models, revealing substantial safety gaps even among top performers. The work demonstrates significant cross-jurisdictional gaps, highlights limitations of existing benchmarks, and provides a practical, auditable platform for measuring model alignment with regulatory and policy-based safety concerns. By enabling direct comparisons across jurisdictions and policies, AIR-Bench 2024 supports safer AI deployment and targeted improvements in risk mitigation.

Abstract

Foundation models (FMs) provide societal benefits but also amplify risks. Governments, companies, and researchers have proposed regulatory frameworks, acceptable use policies, and safety benchmarks in response. However, existing public benchmarks often define safety categories based on previous literature, intuitions, or common sense, leading to disjointed sets of categories for risks specified in recent regulations and policies, which makes it challenging to evaluate and compare FMs across these benchmarks. To bridge this gap, we introduce AIR-Bench 2024, the first AI safety benchmark aligned with emerging government regulations and company policies, following the regulation-based safety categories grounded in our AI risks study, AIR 2024. AIR 2024 decomposes 8 government regulations and 16 company policies into a four-tiered safety taxonomy with 314 granular risk categories in the lowest tier. AIR-Bench 2024 contains 5,694 diverse prompts spanning these categories, with manual curation and human auditing to ensure quality. We evaluate leading language models on AIR-Bench 2024, uncovering insights into their alignment with specified safety concerns. By bridging the gap between public benchmarks and practical AI risks, AIR-Bench 2024 provides a foundation for assessing model safety across jurisdictions, fostering the development of safer and more responsible AI systems.

AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies

TL;DR

AIR-Bench 2024 introduces a regulation-aligned AI safety benchmark built on the AIR 2024 taxonomy, consolidating risks from 8 government regulations and 16 company policies into a four-tier structure with 314 granular level-4 risks. It generates 5,694 prompts across these risks and uses a three-level autograder within the HELM framework to evaluate 22 models, revealing substantial safety gaps even among top performers. The work demonstrates significant cross-jurisdictional gaps, highlights limitations of existing benchmarks, and provides a practical, auditable platform for measuring model alignment with regulatory and policy-based safety concerns. By enabling direct comparisons across jurisdictions and policies, AIR-Bench 2024 supports safer AI deployment and targeted improvements in risk mitigation.

Abstract

Foundation models (FMs) provide societal benefits but also amplify risks. Governments, companies, and researchers have proposed regulatory frameworks, acceptable use policies, and safety benchmarks in response. However, existing public benchmarks often define safety categories based on previous literature, intuitions, or common sense, leading to disjointed sets of categories for risks specified in recent regulations and policies, which makes it challenging to evaluate and compare FMs across these benchmarks. To bridge this gap, we introduce AIR-Bench 2024, the first AI safety benchmark aligned with emerging government regulations and company policies, following the regulation-based safety categories grounded in our AI risks study, AIR 2024. AIR 2024 decomposes 8 government regulations and 16 company policies into a four-tiered safety taxonomy with 314 granular risk categories in the lowest tier. AIR-Bench 2024 contains 5,694 diverse prompts spanning these categories, with manual curation and human auditing to ensure quality. We evaluate leading language models on AIR-Bench 2024, uncovering insights into their alignment with specified safety concerns. By bridging the gap between public benchmarks and practical AI risks, AIR-Bench 2024 provides a foundation for assessing model safety across jurisdictions, fostering the development of safer and more responsible AI systems.
Paper Structure (27 sections, 31 figures, 1 table)

This paper contains 27 sections, 31 figures, 1 table.

Figures (31)

  • Figure 1: Comparison of covered risk categories in leading benchmarks published in 2024 versus the 314 unique risks detailed in AIR-Bench 2024 across 45 medium-level categories, based on AIR 2024. Despite significant efforts towards comprehensivenes, these benchmarks, with the most extensive SALAD-Bench that integrates eight established safety benchmarks, only address 71% of the level-3 risk categories specified in recent government regulations and corporate policies.
  • Figure 2: The gap between existing safety benchmarks and the comprehensive list of risks specified in regulations/policies (the AIR 2024 taxonomy). We depict the normalized distribution within each benchmark, highlighting the biased distribution of each. Meanwhile, the joint set of these leading benchmarks still cannot fill in the gap. Notably, 21 (46%) out of 45 level-3 risk categories have less or equal to one benchmark formally studied.
  • Figure 3: Data and evaluation curation pipeline of the AIR-Bench 2024. (a) illustrates the regulation/policy-taxonomy-based initial curation of base samples; (b) expands the instructions with additional dialect and syntax mutations and additional contextual behaviors; (c) generates customized judge prompts for each risk category evaluation based on model responses. emphasizes manual interactions, ensuring the quality of generated prompts and evaluation settings.
  • Figure 4: Models' output refusal rate across various risk categories. (a) Risk assessment across 45 level-3 categories. (b) We further examine granular level-4 categories of two level-3 risk categories that are more frequently rejected: HTML]FFF3CC#23 (Suicidal and Non-suicidal Self Injury) and HTML]FFF3CC#14 (Hate Speech).
  • Figure 5: Models' output refusal rate across overall less refused risk categories: HTML]FFF3CC#24 (Political Persuasion), HTML]FFF3CC#4 (Automated Decision-Making), and HTML]FFF3CC#6 (Advice in Regulated Industries).
  • ...and 26 more figures