Table of Contents
Fetching ...

Exploring Information-Theoretic Metrics Associated with Neural Collapse in Supervised Training

Kun Song, Zhiquan Tan, Bochao Zou, Jiansheng Chen, Huimin Ma, Weiran Huang

TL;DR

The paper reframes supervised learning through matrix information theory by introducing matrix entropy $H$, the matrix mutual information ratio $\text{MIR}$, and the matrix entropy difference ratio $\text{HDR}$ to quantify interactions between data representations and classifier heads. It presents Cross-Modal Alignment (CMA) loss to improve cross-modal fine-tuning and demonstrates that MIR and HDR capture neural network dynamics across Neural Collapse, linear mode connectivity, and grokking, providing both theoretical and empirical insights. The authors further show how matrix-entropy-based objectives can regularize and improve performance in both supervised and semi-supervised settings, including cross-modal few-shot tasks and SSL frameworks. Together, these contributions offer a principled toolkit for diagnosing and guiding representation-head interactions, with practical impact on cross-modal learning and label-scarce scenarios. The work advances understanding of information flow in deep nets and provides novel metrics and losses that can inform training dynamics and generalization.

Abstract

In this paper, we introduce matrix entropy as an analytical tool for studying supervised learning, investigating the information content of data representations and classification head vectors, as well as the dynamic interactions between them during the supervised learning process. Our experimental results reveal that matrix entropy effectively captures the variations in information content of data representations and classification head vectors as neural networks approach Neural Collapse during supervised training, while also serving as a robust metric for measuring similarity among data samples. Leveraging this property, we propose Cross-Model Alignment (CMA) loss to optimize the fine-tuning of pretrained models. To characterize the dynamics of neural networks nearing the Neural Collapse state, we introduce two novel metrics: the Matrix Mutual Information Ratio (MIR) and the Matrix Entropy Difference Ratio (HDR), which quantitatively assess the interactions between data representations and classification heads in supervised learning, with theoretical optimal values derived under the Neural Collapse state. Our experiments demonstrate that MIR and HDR effectively explain various phenomena in neural networks, including the dynamics of standard supervised training, linear mode connectivity. Moreover, we use MIR and HDR to analyze the dynamics of grokking, which is a fascinating phenomenon in supervised learning where a model unexpectedly exhibits generalization long after achieving training data fit.

Exploring Information-Theoretic Metrics Associated with Neural Collapse in Supervised Training

TL;DR

The paper reframes supervised learning through matrix information theory by introducing matrix entropy , the matrix mutual information ratio , and the matrix entropy difference ratio to quantify interactions between data representations and classifier heads. It presents Cross-Modal Alignment (CMA) loss to improve cross-modal fine-tuning and demonstrates that MIR and HDR capture neural network dynamics across Neural Collapse, linear mode connectivity, and grokking, providing both theoretical and empirical insights. The authors further show how matrix-entropy-based objectives can regularize and improve performance in both supervised and semi-supervised settings, including cross-modal few-shot tasks and SSL frameworks. Together, these contributions offer a principled toolkit for diagnosing and guiding representation-head interactions, with practical impact on cross-modal learning and label-scarce scenarios. The work advances understanding of information flow in deep nets and provides novel metrics and losses that can inform training dynamics and generalization.

Abstract

In this paper, we introduce matrix entropy as an analytical tool for studying supervised learning, investigating the information content of data representations and classification head vectors, as well as the dynamic interactions between them during the supervised learning process. Our experimental results reveal that matrix entropy effectively captures the variations in information content of data representations and classification head vectors as neural networks approach Neural Collapse during supervised training, while also serving as a robust metric for measuring similarity among data samples. Leveraging this property, we propose Cross-Model Alignment (CMA) loss to optimize the fine-tuning of pretrained models. To characterize the dynamics of neural networks nearing the Neural Collapse state, we introduce two novel metrics: the Matrix Mutual Information Ratio (MIR) and the Matrix Entropy Difference Ratio (HDR), which quantitatively assess the interactions between data representations and classification heads in supervised learning, with theoretical optimal values derived under the Neural Collapse state. Our experiments demonstrate that MIR and HDR effectively explain various phenomena in neural networks, including the dynamics of standard supervised training, linear mode connectivity. Moreover, we use MIR and HDR to analyze the dynamics of grokking, which is a fascinating phenomenon in supervised learning where a model unexpectedly exhibits generalization long after achieving training data fit.
Paper Structure (28 sections, 7 theorems, 17 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 28 sections, 7 theorems, 17 equations, 13 figures, 3 tables, 1 algorithm.

Key Result

Theorem 4.2

Given a set of representations $f = {[ h(x_1), h(x_2), \ldots, h(x_n) ]}$, if ${H(\mathbf{G}(f)) = 0}$, the similarities between any representations are $1$, i.e., all the representations are the same, ${h(x_1) = h(x_2) = \ldots = h(x_n)}$.

Figures (13)

  • Figure 1: The calculation of matrix entropy, matrix mutial information ratio and matrix entropy difference ratio.
  • Figure 2: Variations in model accuracy and the matrix information entropy of data representations and classifier weights during the training process on CIFAR-10 and CIFAR-100.
  • Figure 3: Relationship between accuracy, matrix entropy of data representations, and softmax temperature.
  • Figure 4: The SC (Silhouette Coefficient) and DBI (Davies-Bouldin Index) of representation extracted by models trained with different temperature coefficients.
  • Figure 5: Train models on CIFAR-100 with temperature coefficients set to 1 and 10, respectively, and visualize the test set features using t-SNE.
  • ...and 8 more figures

Theorems & Definitions (17)

  • Definition 3.1: Matrix entropy
  • Definition 3.2: Effective Rankroy2007effective
  • Definition 3.3: Matrix mutual information
  • Definition 3.4: Matrix mutual information ratio (MIR)
  • Definition 3.5: Matrix entropy difference ratio (HDR)
  • Definition 4.1: Construction of similarity (gram) matrix
  • Theorem 4.2
  • Theorem 4.3
  • Theorem 4.4
  • proof
  • ...and 7 more