Table of Contents
Fetching ...

Neural Collapse in Multi-label Learning with Pick-all-label Loss

Pengyu Li, Xiao Li, Yutong Wang, Qing Qu

TL;DR

This paper extends neural collapse (NC) from single-label classification to multi-label learning under the pick-all-label (PAL) loss, revealing that last-layer features exhibit an ETF geometry for multiplicity-1 data and a scaled tag-wise average structure for higher multiplicities. Under a PAL-CE objective with an unconstrained feature model (UFM), the authors prove that all global optimizers satisfy Multi-label NC (M-lab NC), including a simplex ETF for the last-layer classifier and a precise tag-wise averaging relationship across multiplicities. The analysis shows M-lab NC holds even with imbalanced multiplicities, provided within-multiplicity balance is maintained, and it yields practical benefits: an one-nearest-neighbor (ONN) encoding for prediction and parameter-efficient training by fixing the ETF classifier and reducing feature dimensions. Empirically, M-lab NC is observed across datasets and architectures, and its geometry guides improved test performance and training efficiency, with broader implications for extreme multi-label and data-imbalanced settings.

Abstract

We study deep neural networks for the multi-label classification (MLab) task through the lens of neural collapse (NC). Previous works have been restricted to the multi-class classification setting and discovered a prevalent NC phenomenon comprising of the following properties for the last-layer features: (i) the variability of features within every class collapses to zero, (ii) the set of feature means form an equi-angular tight frame (ETF), and (iii) the last layer classifiers collapse to the feature mean upon some scaling. We generalize the study to multi-label learning, and prove for the first time that a generalized NC phenomenon holds with the "pick-all-label" formulation, which we term as MLab NC. While the ETF geometry remains consistent for features with a single label, multi-label scenarios introduce a unique combinatorial aspect we term the "tag-wise average" property, where the means of features with multiple labels are the scaled averages of means for single-label instances. Theoretically, under proper assumptions on the features, we establish that the only global optimizer of the pick-all-label cross-entropy loss satisfy the multi-label NC. In practice, we demonstrate that our findings can lead to better test performance with more efficient training techniques for MLab learning.

Neural Collapse in Multi-label Learning with Pick-all-label Loss

TL;DR

This paper extends neural collapse (NC) from single-label classification to multi-label learning under the pick-all-label (PAL) loss, revealing that last-layer features exhibit an ETF geometry for multiplicity-1 data and a scaled tag-wise average structure for higher multiplicities. Under a PAL-CE objective with an unconstrained feature model (UFM), the authors prove that all global optimizers satisfy Multi-label NC (M-lab NC), including a simplex ETF for the last-layer classifier and a precise tag-wise averaging relationship across multiplicities. The analysis shows M-lab NC holds even with imbalanced multiplicities, provided within-multiplicity balance is maintained, and it yields practical benefits: an one-nearest-neighbor (ONN) encoding for prediction and parameter-efficient training by fixing the ETF classifier and reducing feature dimensions. Empirically, M-lab NC is observed across datasets and architectures, and its geometry guides improved test performance and training efficiency, with broader implications for extreme multi-label and data-imbalanced settings.

Abstract

We study deep neural networks for the multi-label classification (MLab) task through the lens of neural collapse (NC). Previous works have been restricted to the multi-class classification setting and discovered a prevalent NC phenomenon comprising of the following properties for the last-layer features: (i) the variability of features within every class collapses to zero, (ii) the set of feature means form an equi-angular tight frame (ETF), and (iii) the last layer classifiers collapse to the feature mean upon some scaling. We generalize the study to multi-label learning, and prove for the first time that a generalized NC phenomenon holds with the "pick-all-label" formulation, which we term as MLab NC. While the ETF geometry remains consistent for features with a single label, multi-label scenarios introduce a unique combinatorial aspect we term the "tag-wise average" property, where the means of features with multiple labels are the scaled averages of means for single-label instances. Theoretically, under proper assumptions on the features, we establish that the only global optimizer of the pick-all-label cross-entropy loss satisfy the multi-label NC. In practice, we demonstrate that our findings can lead to better test performance with more efficient training techniques for MLab learning.
Paper Structure (47 sections, 10 theorems, 130 equations, 8 figures, 2 tables)

This paper contains 47 sections, 10 theorems, 130 equations, 8 figures, 2 tables.

Key Result

Theorem 1

In the setting of Definition definition:UFM, assume the feature dimension is no smaller than the number of classes, i.e., $d \ge K-1$, and assume the training are balanced within each multiplicity as we discussed above. Then any global optimizer $\boldsymbol{W}^\star, \boldsymbol{H}^\star , \boldsym where either $b^\star = 0$ or $\lambda_{\boldsymbol{b}} = 0$. Moreover, the global minimizer $\bold

Figures (8)

  • Figure 1: An illustration of neural collapse for M-clf (top row) vs. M-lab (bottom row) learning. For illustrative purposes, we consider a simple setting with the number of classes $K = 3$. The individual panels are scatterplots showing the top two singular vectors of the last-layer features $\bm{H}$ at the beginning (left) and end (right) stages of training. The solid (resp. dashed) line segments represent the mean of the multiplicity $=1$ (resp. $=2$) features with the same labels. Panel i-iii. As the training progresses, the last-layer features of samples corresponding a single label, e.g., $\mathtt{bird}$, collapse tightly around its mean. Panel iv-vi. The analogous phenomenon holds in the multi-label setting. Panel iv. A training sample has multiplicity $=1$ (resp. $=2$) if it has one tag (resp. two tags). Panel vi. At the end stage of training, the feature mean of Multiplicity-2 $\{\mathtt{bird},\, \mathtt{cat}\}$ is a scaled tag-wise average of feature means of its associated multiplicity-$1$ samples, i.e., $\{\mathtt{bird}\}$ and $\{\mathtt{cat}\}$.
  • Figure 2: M-lab NC holds with imbalanced data in higher multiplicities. (a) and (b) plot metrics that measures M-lab NC on M-lab Cifar10; (c) and (d) visualize learned features on M-lab MNIST, where one multiplicity-2 class is missing in the set up which results in the reduced M-lab NC geometry. As we observe, the ETF structure for Multiplicity-1 still holds. More experimental details are deferred to \ref{['sec:exp']}.
  • Figure 3: Prevalence of M-lab NC across different network architectures on M-lab MNIST (top) and M-lab Cifar10 (bottom). From the left to the right, the plots show the four metrics, $\mathcal{NC}_1, \mathcal{NC}_2, \mathcal{NC}_3$, and $\mathcal{NC}_m$, for measuring M-lab NC. More details about dataset and training setups could be found in \ref{['appendix-section:Mlab_MNIST_C10']}.
  • Figure 4: Prevalence of M-lab NC on the M-lab SVHN dataset. We train ResNets models on the M-lab SVHN dataset netzer2011reading for $400$ epochs and report $\mathcal{NC}_1, \mathcal{NC}_2, \mathcal{NC}_3$, and $\mathcal{NC}_m$, for measuring M-lab NC, respectively. See \ref{['appendix-section:imbala_SVHN']} for more details.
  • Figure 5: M-lab NC phenomenon in extreme MS-COCO dataset. As we can see that all M-lab NC measures converges to small values.
  • ...and 3 more figures

Theorems & Definitions (21)

  • Definition 1: Nonconvex Training Loss under UFM
  • Theorem 1: Global Optimality of M-lab NC
  • Theorem 2: Benign Optimization Landscape
  • proof : Proof of \ref{['thm:GO_thm']}
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • ...and 11 more