Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs

Alexander von Recum; Christoph Schnabl; Gabor Hollbeck; Silas Alberti; Philip Blinde; Marvin von Hagen

Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs

Alexander von Recum, Christoph Schnabl, Gabor Hollbeck, Silas Alberti, Philip Blinde, Marvin von Hagen

TL;DR

The paper addresses the need to audit and analyze refusals in LLMs by introducing a unified 16-category refusal taxonomy (with 992 leaves) and large-scale datasets (8,650 real, 501 multi-annotator, and ~7.17M synthetic refusals). It formalizes the refusal identification and classification framework, and develops two classifiers (BERT-based and NV-Embed-V2 embedding–based logistic regression) trained on synthetic data to predict refusal likelihood and categories. Empirical results show moderate human–annotator agreement, LLMs achieving varying but generally competitive agreement with humans, and a cost-effective embedding-based classifier achieving comparable performance to state-of-the-art LLMs for large-scale refuse auditing. The work provides public resources for auditing, enabling safer and more reliable IFT/RLHF datasets and informing policy for LLM refusals at scale.

Abstract

Refusals - instances where large language models (LLMs) decline or fail to fully execute user instructions - are crucial for both AI safety and AI capabilities and the reduction of hallucinations in particular. These behaviors are learned during post-training, especially in instruction fine-tuning (IFT) and reinforcement learning from human feedback (RLHF). However, existing taxonomies and evaluation datasets for refusals are inadequate, often focusing solely on should-not-related (instead of cannot-related) categories, and lacking tools for auditing refusal content in black-box LLM outputs. We present a comprehensive framework for classifying LLM refusals: (a) a taxonomy of 16 refusal categories, (b) a human-annotated dataset of over 8,600 instances from publicly available IFT and RLHF datasets, (c) a synthetic dataset with 8,000 examples for each refusal category, and (d) classifiers trained for refusal classification. Our work enables precise auditing of refusal behaviors in black-box LLMs and automatic analyses of refusal patterns in large IFT and RLHF datasets. This facilitates the strategic adjustment of LLM refusals, contributing to the development of more safe and reliable LLMs.

Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs

TL;DR

Abstract

Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)