Table of Contents
Fetching ...

AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance

Bill Marino, Rosco Hunter, Zubair Jamali, Marinos Emmanouil Kalpakos, Mudra Kashyap, Isaiah Hinton, Alexa Hanson, Maahum Nazir, Christoph Schnabl, Felix Steffek, Hongkai Wen, Nicholas D. Lane

TL;DR

AIReg-Bench addresses the lack of quantitative benchmarks for assessing LLMs’ capability to evaluate AI Regulation (AIR) compliance under the EU AI Act. It combines an LLM-driven pipeline that generates 120 plausible high‑risk AI system excerpts with legal expert annotations, enabling a benchmark to compare frontier LLMs against human judgments. The first experiments show that certain models, notably Gemini 2.5 Pro, can closely approximate expert compliance judgments (κ_w ≈ 0.86, ρ ≈ 0.86) but reveal biases and variability across prompts and models. By providing an open dataset and code, AIReg-Bench establishes a foundational, extensible framework to study, compare, and improve LLM-based AIR compliance assessments.

Abstract

As governments move to regulate AI, there is growing interest in using Large Language Models (LLMs) to assess whether or not an AI system complies with a given AI Regulation (AIR). However, there is presently no way to benchmark the performance of LLMs at this task. To fill this void, we introduce AIReg-Bench: the first benchmark dataset designed to test how well LLMs can assess compliance with the EU AI Act (AIA). We created this dataset through a two-step process: (1) by prompting an LLM with carefully structured instructions, we generated 120 technical documentation excerpts (samples), each depicting a fictional, albeit plausible, AI system - of the kind an AI provider might produce to demonstrate their compliance with AIR; (2) legal experts then reviewed and annotated each sample to indicate whether, and in what way, the AI system described therein violates specific Articles of the AIA. The resulting dataset, together with our evaluation of whether frontier LLMs can reproduce the experts' compliance labels, provides a starting point to understand the opportunities and limitations of LLM-based AIR compliance assessment tools and establishes a benchmark against which subsequent LLMs can be compared. The dataset and evaluation code are available at https://github.com/camlsys/aireg-bench.

AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance

TL;DR

AIReg-Bench addresses the lack of quantitative benchmarks for assessing LLMs’ capability to evaluate AI Regulation (AIR) compliance under the EU AI Act. It combines an LLM-driven pipeline that generates 120 plausible high‑risk AI system excerpts with legal expert annotations, enabling a benchmark to compare frontier LLMs against human judgments. The first experiments show that certain models, notably Gemini 2.5 Pro, can closely approximate expert compliance judgments (κ_w ≈ 0.86, ρ ≈ 0.86) but reveal biases and variability across prompts and models. By providing an open dataset and code, AIReg-Bench establishes a foundational, extensible framework to study, compare, and improve LLM-based AIR compliance assessments.

Abstract

As governments move to regulate AI, there is growing interest in using Large Language Models (LLMs) to assess whether or not an AI system complies with a given AI Regulation (AIR). However, there is presently no way to benchmark the performance of LLMs at this task. To fill this void, we introduce AIReg-Bench: the first benchmark dataset designed to test how well LLMs can assess compliance with the EU AI Act (AIA). We created this dataset through a two-step process: (1) by prompting an LLM with carefully structured instructions, we generated 120 technical documentation excerpts (samples), each depicting a fictional, albeit plausible, AI system - of the kind an AI provider might produce to demonstrate their compliance with AIR; (2) legal experts then reviewed and annotated each sample to indicate whether, and in what way, the AI system described therein violates specific Articles of the AIA. The resulting dataset, together with our evaluation of whether frontier LLMs can reproduce the experts' compliance labels, provides a starting point to understand the opportunities and limitations of LLM-based AIR compliance assessment tools and establishes a benchmark against which subsequent LLMs can be compared. The dataset and evaluation code are available at https://github.com/camlsys/aireg-bench.

Paper Structure

This paper contains 29 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Cohen's $\kappa$ (quadratically weighted) scores across frontier language models, showing the level of agreement on compliance judgments (on a 1-5 Likert scale) between these models and the median legal expert in our team, taken over the entire AIReg-Bench dataset.
  • Figure 2: Illustration of the AIReg-Bench Technical Documentation Excerpt Generation Pipeline.
  • Figure 3: Heatmaps of compliance performance. The left panels show the distribution of compliance ratings (in 'confusion matrix'), comparing the median human expert with LLMs. The right panels show mean absolute error (MAE) across use cases and articles. Results are shown for Gemini 2.5 Pro (top) and as an average over all evaluated LLMs (bottom).
  • Figure 4: Pareto frontier of model cost versus compliance agreement (Cohen's $\kappa$). Each point represents a model, plotted by price (x-axis) and agreement with human expert ratings (y-axis). Pareto-efficient models are shown with red markers. Labels denote model names.