Neural Collapse in Multi-label Learning with Pick-all-label Loss
Pengyu Li, Xiao Li, Yutong Wang, Qing Qu
TL;DR
This paper extends neural collapse (NC) from single-label classification to multi-label learning under the pick-all-label (PAL) loss, revealing that last-layer features exhibit an ETF geometry for multiplicity-1 data and a scaled tag-wise average structure for higher multiplicities. Under a PAL-CE objective with an unconstrained feature model (UFM), the authors prove that all global optimizers satisfy Multi-label NC (M-lab NC), including a simplex ETF for the last-layer classifier and a precise tag-wise averaging relationship across multiplicities. The analysis shows M-lab NC holds even with imbalanced multiplicities, provided within-multiplicity balance is maintained, and it yields practical benefits: an one-nearest-neighbor (ONN) encoding for prediction and parameter-efficient training by fixing the ETF classifier and reducing feature dimensions. Empirically, M-lab NC is observed across datasets and architectures, and its geometry guides improved test performance and training efficiency, with broader implications for extreme multi-label and data-imbalanced settings.
Abstract
We study deep neural networks for the multi-label classification (MLab) task through the lens of neural collapse (NC). Previous works have been restricted to the multi-class classification setting and discovered a prevalent NC phenomenon comprising of the following properties for the last-layer features: (i) the variability of features within every class collapses to zero, (ii) the set of feature means form an equi-angular tight frame (ETF), and (iii) the last layer classifiers collapse to the feature mean upon some scaling. We generalize the study to multi-label learning, and prove for the first time that a generalized NC phenomenon holds with the "pick-all-label" formulation, which we term as MLab NC. While the ETF geometry remains consistent for features with a single label, multi-label scenarios introduce a unique combinatorial aspect we term the "tag-wise average" property, where the means of features with multiple labels are the scaled averages of means for single-label instances. Theoretically, under proper assumptions on the features, we establish that the only global optimizer of the pick-all-label cross-entropy loss satisfy the multi-label NC. In practice, we demonstrate that our findings can lead to better test performance with more efficient training techniques for MLab learning.
