Table of Contents
Fetching ...

Large Language Models Do Multi-Label Classification Differently

Marcus Ma, Georgios Chochlakis, Niyantha Maruthu Pandiyan, Jesse Thomason, Shrikanth Narayanan

TL;DR

This work investigates how large language models handle multi-label classification, revealing that autoregressive generation produces spiky, sequential single-label predictions rather than coherent multi-label distributions. By analyzing per-step label probabilities and comparing them to human annotator distributions, the authors introduce distribution alignment as a core task and develop zero-shot and supervised methods to improve alignment and F1 scores. A key finding is that a simple max-over-generations technique substantially boosts alignment and predictive performance without extra computation. The study demonstrates the need to rethink multi-label inference with LLMs and offers practical methods to align model confidences with subjective human distributions in real-world tasks.

Abstract

Multi-label classification is prevalent in real-world settings, but the behavior of Large Language Models (LLMs) in this setting is understudied. We investigate how autoregressive LLMs perform multi-label classification, focusing on subjective tasks, by analyzing the output distributions of the models at each label generation step. We find that the initial probability distribution for the first label often does not reflect the eventual final output, even in terms of relative order and find LLMs tend to suppress all but one label at each generation step. We further observe that as model scale increases, their token distributions exhibit lower entropy and higher single-label confidence, but the internal relative ranking of the labels improves. Finetuning methods such as supervised finetuning and reinforcement learning amplify this phenomenon. We introduce the task of distribution alignment for multi-label settings: aligning LLM-derived label distributions with empirical distributions estimated from annotator responses in subjective tasks. We propose both zero-shot and supervised methods which improve both alignment and predictive performance over existing approaches. We find one method -- taking the max probability over all label generation distributions instead of just using the initial probability distribution -- improves both distribution alignment and overall F1 classification without adding any additional computation.

Large Language Models Do Multi-Label Classification Differently

TL;DR

This work investigates how large language models handle multi-label classification, revealing that autoregressive generation produces spiky, sequential single-label predictions rather than coherent multi-label distributions. By analyzing per-step label probabilities and comparing them to human annotator distributions, the authors introduce distribution alignment as a core task and develop zero-shot and supervised methods to improve alignment and F1 scores. A key finding is that a simple max-over-generations technique substantially boosts alignment and predictive performance without extra computation. The study demonstrates the need to rethink multi-label inference with LLMs and offers practical methods to align model confidences with subjective human distributions in real-world tasks.

Abstract

Multi-label classification is prevalent in real-world settings, but the behavior of Large Language Models (LLMs) in this setting is understudied. We investigate how autoregressive LLMs perform multi-label classification, focusing on subjective tasks, by analyzing the output distributions of the models at each label generation step. We find that the initial probability distribution for the first label often does not reflect the eventual final output, even in terms of relative order and find LLMs tend to suppress all but one label at each generation step. We further observe that as model scale increases, their token distributions exhibit lower entropy and higher single-label confidence, but the internal relative ranking of the labels improves. Finetuning methods such as supervised finetuning and reinforcement learning amplify this phenomenon. We introduce the task of distribution alignment for multi-label settings: aligning LLM-derived label distributions with empirical distributions estimated from annotator responses in subjective tasks. We propose both zero-shot and supervised methods which improve both alignment and predictive performance over existing approaches. We find one method -- taking the max probability over all label generation distributions instead of just using the initial probability distribution -- improves both distribution alignment and overall F1 classification without adding any additional computation.

Paper Structure

This paper contains 69 sections, 2 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Autoregressive language modeling is incompatible and interferes with multi-label classification: LLMs generate one label at a time with unrepresentative distributions misaligned from reference distributions.
  • Figure 2: Top probabilities at each generation step when the last or an intermediate label is generated. Patterns are identical between the two settings, and bigger or finetuned models have clusters closer to 100%. A single step only is shown when only up to labels were generated for all examples in a specific setting.
  • Figure 3: Second-highest probabilities at each generation step when the last or an intermediate label is generated. We also show the probability at the current step of the label that is actually predicted in the next step ($r$+1 pred), the probability at the next generation step of the second highest probability of the current step (intermediate @$r$+1), and the percentage of cases the second-highest probability label at step $r$ and the prediction at $r$+1 is the same. LLM distributions show poor relative ranking, and little distinction between the last and intermediate settings. A single step only is shown when only up to labels were generated for all examples in a specific setting.
  • Figure 4: Sorted label probabilities when generating the first label for Llama3 70B Instruct. Most distributions are spiky, with the top label having near-1 probability.
  • Figure 5: Average accuracy of the first and second label for multi-label generations based on the order in which it was generated, showing decreasing trends. Line color represents dataset and line pattern represents model size.
  • ...and 11 more figures