Neural Collapse in Multi-label Learning with Pick-all-label Loss

Pengyu Li; Xiao Li; Yutong Wang; Qing Qu

Neural Collapse in Multi-label Learning with Pick-all-label Loss

Pengyu Li, Xiao Li, Yutong Wang, Qing Qu

TL;DR

This paper extends neural collapse (NC) from single-label classification to multi-label learning under the pick-all-label (PAL) loss, revealing that last-layer features exhibit an ETF geometry for multiplicity-1 data and a scaled tag-wise average structure for higher multiplicities. Under a PAL-CE objective with an unconstrained feature model (UFM), the authors prove that all global optimizers satisfy Multi-label NC (M-lab NC), including a simplex ETF for the last-layer classifier and a precise tag-wise averaging relationship across multiplicities. The analysis shows M-lab NC holds even with imbalanced multiplicities, provided within-multiplicity balance is maintained, and it yields practical benefits: an one-nearest-neighbor (ONN) encoding for prediction and parameter-efficient training by fixing the ETF classifier and reducing feature dimensions. Empirically, M-lab NC is observed across datasets and architectures, and its geometry guides improved test performance and training efficiency, with broader implications for extreme multi-label and data-imbalanced settings.

Abstract

We study deep neural networks for the multi-label classification (MLab) task through the lens of neural collapse (NC). Previous works have been restricted to the multi-class classification setting and discovered a prevalent NC phenomenon comprising of the following properties for the last-layer features: (i) the variability of features within every class collapses to zero, (ii) the set of feature means form an equi-angular tight frame (ETF), and (iii) the last layer classifiers collapse to the feature mean upon some scaling. We generalize the study to multi-label learning, and prove for the first time that a generalized NC phenomenon holds with the "pick-all-label" formulation, which we term as MLab NC. While the ETF geometry remains consistent for features with a single label, multi-label scenarios introduce a unique combinatorial aspect we term the "tag-wise average" property, where the means of features with multiple labels are the scaled averages of means for single-label instances. Theoretically, under proper assumptions on the features, we establish that the only global optimizer of the pick-all-label cross-entropy loss satisfy the multi-label NC. In practice, we demonstrate that our findings can lead to better test performance with more efficient training techniques for MLab learning.

Neural Collapse in Multi-label Learning with Pick-all-label Loss

TL;DR

Abstract

Paper Structure (47 sections, 10 theorems, 130 equations, 8 figures, 2 tables)

This paper contains 47 sections, 10 theorems, 130 equations, 8 figures, 2 tables.

Introduction
Our contributions.
Related works on multi-label learning.
Related works on neural collapse.
Basic notations.
Paper organization.
Problem Formulation
Notations for multi-label dataset.
The "pick-all-labels" loss.
Optimization under the unconstrained feature model (UFM).
Main Results
Multi-label Neural Collapse (M-lab NC)
Remarks.
Global Optimality & Benign Landscape Under UFM
Global Optimality of M-lab NC
...and 32 more sections

Key Result

Theorem 1

In the setting of Definition definition:UFM, assume the feature dimension is no smaller than the number of classes, i.e., $d \ge K-1$, and assume the training are balanced within each multiplicity as we discussed above. Then any global optimizer $\boldsymbol{W}^\star, \boldsymbol{H}^\star , \boldsym where either $b^\star = 0$ or $\lambda_{\boldsymbol{b}} = 0$. Moreover, the global minimizer $\bold

Figures (8)

Figure 1: An illustration of neural collapse for M-clf (top row) vs. M-lab (bottom row) learning. For illustrative purposes, we consider a simple setting with the number of classes $K = 3$. The individual panels are scatterplots showing the top two singular vectors of the last-layer features $\bm{H}$ at the beginning (left) and end (right) stages of training. The solid (resp. dashed) line segments represent the mean of the multiplicity $=1$ (resp. $=2$) features with the same labels. Panel i-iii. As the training progresses, the last-layer features of samples corresponding a single label, e.g., $\mathtt{bird}$, collapse tightly around its mean. Panel iv-vi. The analogous phenomenon holds in the multi-label setting. Panel iv. A training sample has multiplicity $=1$ (resp. $=2$) if it has one tag (resp. two tags). Panel vi. At the end stage of training, the feature mean of Multiplicity-2 $\{\mathtt{bird},\, \mathtt{cat}\}$ is a scaled tag-wise average of feature means of its associated multiplicity-$1$ samples, i.e., $\{\mathtt{bird}\}$ and $\{\mathtt{cat}\}$.
Figure 2: M-lab NC holds with imbalanced data in higher multiplicities. (a) and (b) plot metrics that measures M-lab NC on M-lab Cifar10; (c) and (d) visualize learned features on M-lab MNIST, where one multiplicity-2 class is missing in the set up which results in the reduced M-lab NC geometry. As we observe, the ETF structure for Multiplicity-1 still holds. More experimental details are deferred to \ref{['sec:exp']}.
Figure 3: Prevalence of M-lab NC across different network architectures on M-lab MNIST (top) and M-lab Cifar10 (bottom). From the left to the right, the plots show the four metrics, $\mathcal{NC}_1, \mathcal{NC}_2, \mathcal{NC}_3$, and $\mathcal{NC}_m$, for measuring M-lab NC. More details about dataset and training setups could be found in \ref{['appendix-section:Mlab_MNIST_C10']}.
Figure 4: Prevalence of M-lab NC on the M-lab SVHN dataset. We train ResNets models on the M-lab SVHN dataset netzer2011reading for $400$ epochs and report $\mathcal{NC}_1, \mathcal{NC}_2, \mathcal{NC}_3$, and $\mathcal{NC}_m$, for measuring M-lab NC, respectively. See \ref{['appendix-section:imbala_SVHN']} for more details.
Figure 5: M-lab NC phenomenon in extreme MS-COCO dataset. As we can see that all M-lab NC measures converges to small values.
...and 3 more figures

Theorems & Definitions (21)

Definition 1: Nonconvex Training Loss under UFM
Theorem 1: Global Optimality of M-lab NC
Theorem 2: Benign Optimization Landscape
proof : Proof of \ref{['thm:GO_thm']}
Lemma 1
proof
Lemma 2
proof
Lemma 3
proof
...and 11 more

Neural Collapse in Multi-label Learning with Pick-all-label Loss

TL;DR

Abstract

Neural Collapse in Multi-label Learning with Pick-all-label Loss

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (21)