Table of Contents
Fetching ...

From Haystack to Needle: Label Space Reduction for Zero-shot Classification

Nathan Vandemoortele, Bram Steenwinckel, Femke Ongenae, Sofie Van Hoecke

TL;DR

Zero-shot classification with large label spaces challenges LLMs due to attention and reasoning constraints. The paper introduces Label Space Reduction (LSR), an iterative framework that jointly ranks and prunes candidate labels and uses a probabilistic classifier distilled from LSR to enable efficient inference. Across seven benchmarks and multiple LLMs, LSR yields substantial macro-F1 gains (averaging around 7% with Llama-3.1-70B and up to 14% on some tasks) and robust improvements over standard zero-shot baselines. The approach demonstrates practical impact by enabling competitive performance with distillation, and provides directions for automation and extension to multi-label settings.

Abstract

We present Label Space Reduction (LSR), a novel method for improving zero-shot classification performance of Large Language Models (LLMs). LSR iteratively refines the classification label space by systematically ranking and reducing candidate classes, enabling the model to concentrate on the most relevant options. By leveraging unlabeled data with the statistical learning capabilities of data-driven models, LSR dynamically optimizes the label space representation at test time. Our experiments across seven benchmarks demonstrate that LSR improves macro-F1 scores by an average of 7.0% (up to 14.2%) with Llama-3.1-70B and 3.3% (up to 11.1%) with Claude-3.5-Sonnet compared to standard zero-shot classification baselines. To reduce the computational overhead of LSR, which requires an additional LLM call at each iteration, we propose distilling the model into a probabilistic classifier, allowing for efficient inference.

From Haystack to Needle: Label Space Reduction for Zero-shot Classification

TL;DR

Zero-shot classification with large label spaces challenges LLMs due to attention and reasoning constraints. The paper introduces Label Space Reduction (LSR), an iterative framework that jointly ranks and prunes candidate labels and uses a probabilistic classifier distilled from LSR to enable efficient inference. Across seven benchmarks and multiple LLMs, LSR yields substantial macro-F1 gains (averaging around 7% with Llama-3.1-70B and up to 14% on some tasks) and robust improvements over standard zero-shot baselines. The approach demonstrates practical impact by enabling competitive performance with distillation, and provides directions for automation and extension to multi-label settings.

Abstract

We present Label Space Reduction (LSR), a novel method for improving zero-shot classification performance of Large Language Models (LLMs). LSR iteratively refines the classification label space by systematically ranking and reducing candidate classes, enabling the model to concentrate on the most relevant options. By leveraging unlabeled data with the statistical learning capabilities of data-driven models, LSR dynamically optimizes the label space representation at test time. Our experiments across seven benchmarks demonstrate that LSR improves macro-F1 scores by an average of 7.0% (up to 14.2%) with Llama-3.1-70B and 3.3% (up to 11.1%) with Claude-3.5-Sonnet compared to standard zero-shot classification baselines. To reduce the computational overhead of LSR, which requires an additional LLM call at each iteration, we propose distilling the model into a probabilistic classifier, allowing for efficient inference.

Paper Structure

This paper contains 33 sections, 11 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration of the proposed methodology. (1) The LLM categorizes the data by selecting labels from the full label set. (2) With these pseudo-labels a classifier is trained on the data's numerical representation, which generates probabilities for each class. (3) The labels are then ranked and filtered based on an adaptive threshold to form a reduced label set, which is fed back to the LLM, initiating the next iteration.
  • Figure 2: Performance comparison of LSR with varying label space sizes (k). When $k=Full$, the labels are ranked but not reduced. Lines show macro-F1 scores over 15 iterations, starting from zero-shot baseline predictions (Llama-3.1-70B).
  • Figure 3: Hit rates of sampling strategies across reduced label space sizes ($k$). Results show the percentage of true labels inside the reduced space using Llama-3.1-70B during the first iteration of LSR.
  • Figure 4: Performance of LSR ($k=2$) with various LLMs. Lines show macro-F1 scores over 15 iterations, starting from zero-shot baseline predictions.
  • Figure 5: Results of LSR with varying unlabeled subset sizes for training. Lines show macro-F1 scores over 15 iterations, with $k=2$ using Llama-3.1-70B.
  • ...and 1 more figures