Large Language Models Do Multi-Label Classification Differently
Marcus Ma, Georgios Chochlakis, Niyantha Maruthu Pandiyan, Jesse Thomason, Shrikanth Narayanan
TL;DR
This work investigates how large language models handle multi-label classification, revealing that autoregressive generation produces spiky, sequential single-label predictions rather than coherent multi-label distributions. By analyzing per-step label probabilities and comparing them to human annotator distributions, the authors introduce distribution alignment as a core task and develop zero-shot and supervised methods to improve alignment and F1 scores. A key finding is that a simple max-over-generations technique substantially boosts alignment and predictive performance without extra computation. The study demonstrates the need to rethink multi-label inference with LLMs and offers practical methods to align model confidences with subjective human distributions in real-world tasks.
Abstract
Multi-label classification is prevalent in real-world settings, but the behavior of Large Language Models (LLMs) in this setting is understudied. We investigate how autoregressive LLMs perform multi-label classification, focusing on subjective tasks, by analyzing the output distributions of the models at each label generation step. We find that the initial probability distribution for the first label often does not reflect the eventual final output, even in terms of relative order and find LLMs tend to suppress all but one label at each generation step. We further observe that as model scale increases, their token distributions exhibit lower entropy and higher single-label confidence, but the internal relative ranking of the labels improves. Finetuning methods such as supervised finetuning and reinforcement learning amplify this phenomenon. We introduce the task of distribution alignment for multi-label settings: aligning LLM-derived label distributions with empirical distributions estimated from annotator responses in subjective tasks. We propose both zero-shot and supervised methods which improve both alignment and predictive performance over existing approaches. We find one method -- taking the max probability over all label generation distributions instead of just using the initial probability distribution -- improves both distribution alignment and overall F1 classification without adding any additional computation.
