CALM: Curiosity-Driven Auditing for Large Language Models

Xiang Zheng; Longxiang Wang; Yi Liu; Xingjun Ma; Chao Shen; Cong Wang

CALM: Curiosity-Driven Auditing for Large Language Models

Xiang Zheng, Longxiang Wang, Yi Liu, Xingjun Ma, Chao Shen, Cong Wang

TL;DR

This work tackles the challenge of auditing black-box LLMs by proposing CALM, a curiosity-driven auditing framework that finetunes an audit LLM through intrinsically motivated RL. The method introduces a regularized auditing objective that blends extrinsic safety rewards with a token-level intrinsic bonus derived from policy cover, encouraging exploration of novel prompts. Two auditing tasks—inverse suffix generation and toxic completion—are used to demonstrate CALM's ability to reveal derisive or toxic outputs across multiple target models, with CALM typically achieving faster convergence and higher coverage than baselines. The results underscore the importance of intrinsic motivation for discovering rare but unsafe behaviors, offering a scalable approach to auditing cloud-based LLM services and enhancing safety governance.

Abstract

Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs. Our code is available at https://github.com/x-zheng16/CALM.git.

CALM: Curiosity-Driven Auditing for Large Language Models

TL;DR

Abstract

Paper Structure (32 sections, 7 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 32 sections, 7 equations, 4 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Algorithmic auditing.
LLM-assisted red teaming.
Preliminaries
Interaction with the target LLM.
Reinforcement fine-tuning of the audit LLM.
Curiosity-Driven Auditing
Problems of previous auditing method.
Our approach.
Regularized Auditing Objective
Selection of auditing objectives.
Token-Level Intrinsic Bonus
Experiments
Experiments Setup
...and 17 more sections

Figures (4)

Figure 1: Performance in the inverse suffix generation task with the intrinsic coefficient $\lambda=10$.
Figure 2: L0 norm of the NameSet coverage in the inverse suffix generation task with the intrinsic coefficient $\lambda=10$.
Figure 3: Ablation study on the intrinsic coefficient in the inverse suffix generation task with $\lambda=100$.
Figure 4: Performance in the toxic completion task with the intrinsic coefficient $\lambda=10$.

CALM: Curiosity-Driven Auditing for Large Language Models

TL;DR

Abstract

CALM: Curiosity-Driven Auditing for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)