Table of Contents
Fetching ...

Active Slice Discovery in Large Language Models

Minhui Zhang, Prahar Ijner, Yoav Wald, Elliot Creager

TL;DR

This work tackles the problem of discovering coherent error slices in toxicity classification by Large Language Models using an active, annotation-efficient approach. It formalizes Active Slice Discovery as an interactive loop where an active learner queries an oracle to confirm whether specific samples belong to a slice, updating a slice membership function. The study evaluates representations (raw Llama-3.1-8B embeddings and Sparse Autoencoder features) and classifiers (MLP and linear SVM) with various query strategies on the Jigsaw toxicity dataset, finding that uncertainty-based active learning achieves competitive accuracy with as little as 2-10% of slice labels and up to 10%. Demonstrates practical value by enabling efficient, annotation-light diagnostics for model bias and safety concerns.

Abstract

Large Language Models (LLMs) often exhibit systematic errors on specific subsets of data, known as error slices. For instance, a slice can correspond to a certain demographic, where a model does poorly in identifying toxic comments regarding that demographic. Identifying error slices is crucial to understanding and improving models, but it is also challenging. An appealing approach to reduce the amount of manual annotation required is to actively group errors that are likely to belong to the same slice, while using limited access to an annotator to verify whether the chosen samples share the same pattern of model mistake. In this paper, we formalize this approach as Active Slice Discovery and explore it empirically on a problem of discovering human-defined slices in toxicity classification. We examine the efficacy of active slice discovery under different choices of feature representations and active learning algorithms. On several slices, we find that uncertainty-based active learning algorithms are most effective, achieving competitive accuracy using 2-10% of the available slice membership information, while significantly outperforming baselines.

Active Slice Discovery in Large Language Models

TL;DR

This work tackles the problem of discovering coherent error slices in toxicity classification by Large Language Models using an active, annotation-efficient approach. It formalizes Active Slice Discovery as an interactive loop where an active learner queries an oracle to confirm whether specific samples belong to a slice, updating a slice membership function. The study evaluates representations (raw Llama-3.1-8B embeddings and Sparse Autoencoder features) and classifiers (MLP and linear SVM) with various query strategies on the Jigsaw toxicity dataset, finding that uncertainty-based active learning achieves competitive accuracy with as little as 2-10% of slice labels and up to 10%. Demonstrates practical value by enabling efficient, annotation-light diagnostics for model bias and safety concerns.

Abstract

Large Language Models (LLMs) often exhibit systematic errors on specific subsets of data, known as error slices. For instance, a slice can correspond to a certain demographic, where a model does poorly in identifying toxic comments regarding that demographic. Identifying error slices is crucial to understanding and improving models, but it is also challenging. An appealing approach to reduce the amount of manual annotation required is to actively group errors that are likely to belong to the same slice, while using limited access to an annotator to verify whether the chosen samples share the same pattern of model mistake. In this paper, we formalize this approach as Active Slice Discovery and explore it empirically on a problem of discovering human-defined slices in toxicity classification. We examine the efficacy of active slice discovery under different choices of feature representations and active learning algorithms. On several slices, we find that uncertainty-based active learning algorithms are most effective, achieving competitive accuracy using 2-10% of the available slice membership information, while significantly outperforming baselines.

Paper Structure

This paper contains 7 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Active Slice Discovery Given a dataset containing target labels for each data point, but slice labels for a limited set of data points, we pose the active learning problem of uncovering latent error slices within the data. The active learner has limited access to an oracle (e.g. human labeler) who can confirm whether or not a specific data point belongs to the error slice.
  • Figure 2: Active learning with SVM (Least Confidence) on multiple slices. Test accuracy vs. number of labeled examples is shown for two setups: (a) raw LLM embeddings and (b) SAE representations. Note that each of the four slides is a different size, leading of a different max number of labeled samples.
  • Figure 3: Query strategy comparison for the "disagree" slice. Test accuracy vs. number of labeled examples is shown for various query strategies acting on (a) raw LLM embeddings; and (b) SAE representations. Confidence-based query strategies (Least Confidence, Prediction Entropy, Breaking Ties) consistently yield better performance.