Table of Contents
Fetching ...

Evaluating Human-AI Safety: A Framework for Measuring Harmful Capability Uplift

Michelle Vaccaro, Jaeyoon Song, Abdullah Almaatouq, Michiel A. Bakker

Abstract

Current frontier AI safety evaluations emphasize static benchmarks, third-party annotations, and red-teaming. In this position paper, we argue that AI safety research should focus on human-centered evaluations that measure harmful capability uplift: the marginal increase in a user's ability to cause harm with a frontier model beyond what conventional tools already enable. We frame harmful capability uplift as a core AI safety metric, ground it in prior social science research, and provide concrete methodological guidance for systematic measurement. We conclude with actionable steps for developers, researchers, funders, and regulators to make harmful capability uplift evaluation a standard practice.

Evaluating Human-AI Safety: A Framework for Measuring Harmful Capability Uplift

Abstract

Current frontier AI safety evaluations emphasize static benchmarks, third-party annotations, and red-teaming. In this position paper, we argue that AI safety research should focus on human-centered evaluations that measure harmful capability uplift: the marginal increase in a user's ability to cause harm with a frontier model beyond what conventional tools already enable. We frame harmful capability uplift as a core AI safety metric, ground it in prior social science research, and provide concrete methodological guidance for systematic measurement. We conclude with actionable steps for developers, researchers, funders, and regulators to make harmful capability uplift evaluation a standard practice.

Paper Structure

This paper contains 27 sections, 9 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Approaches to AI Safety Evaluation. (Left) Current evaluations focus on isolated AI model outputs using static benchmarks, with human judges occasionally assessing the output from external observation points. (Right) Our proposed approach evaluates the human-AI system, measuring what malicious tasks a human-AI combination can accomplish using the harmful capability uplift metric.
  • Figure 2: The Harmful Capability Uplift Framework. (Left) The harmful capability uplift metric $U$ quantifies how much frontier models amplify the ability of people to perform malicious tasks. (Middle) Novel capability acquisition occurs when AI assistance enables previously impossible tasks. (Right) Hypothetical biosecurity analysis demonstrates how harmful capability uplift can vary across task dimensions.
  • Figure 3: The Proxy Task Challenge. (Left) Direct assessment of harmful capabilities raises ethical concerns, necessitating safe proxy tasks. (Middle) A formal task similarity framework quantifies the predictive relationship between proxy and target tasks. (Right) Empirical validation studies establish when proxy task performance reliably predicts capabilities on target tasks of concern, enabling evidence-based safety assessment without conducting harmful experiments.