Table of Contents
Fetching ...

Unveiling the Dynamics of Information Interplay in Supervised Learning

Kun Song, Zhiquan Tan, Bochao Zou, Huimin Ma, Weiran Huang

TL;DR

The paper introduces matrix mutual information ratio ($MIR$) and matrix entropy difference ratio ($HDR$) as information-theoretic tools to study the interplay between data representations and classifier heads in supervised learning, grounded in Neural Collapse. It derives theoretical properties of $MIR$ and $HDR$ at Neural Collapse and demonstrates their ability to describe training dynamics, linear mode connectivity, grokking, and effects of label smoothing and pruning. The authors further integrate $MIR$ and $HDR$ as auxiliary loss terms for both supervised and semi-supervised learning, achieving improvements on standard benchmarks and better utilization of unlabeled data. Overall, the work provides a new analytical framework and practical training enhancements by linking representation–head information measures with learning dynamics and generalization performance.

Abstract

In this paper, we use matrix information theory as an analytical tool to analyze the dynamics of the information interplay between data representations and classification head vectors in the supervised learning process. Specifically, inspired by the theory of Neural Collapse, we introduce matrix mutual information ratio (MIR) and matrix entropy difference ratio (HDR) to assess the interactions of data representation and class classification heads in supervised learning, and we determine the theoretical optimal values for MIR and HDR when Neural Collapse happens. Our experiments show that MIR and HDR can effectively explain many phenomena occurring in neural networks, for example, the standard supervised training dynamics, linear mode connectivity, and the performance of label smoothing and pruning. Additionally, we use MIR and HDR to gain insights into the dynamics of grokking, which is an intriguing phenomenon observed in supervised training, where the model demonstrates generalization capabilities long after it has learned to fit the training data. Furthermore, we introduce MIR and HDR as loss terms in supervised and semi-supervised learning to optimize the information interactions among samples and classification heads. The empirical results provide evidence of the method's effectiveness, demonstrating that the utilization of MIR and HDR not only aids in comprehending the dynamics throughout the training process but can also enhances the training procedure itself.

Unveiling the Dynamics of Information Interplay in Supervised Learning

TL;DR

The paper introduces matrix mutual information ratio () and matrix entropy difference ratio () as information-theoretic tools to study the interplay between data representations and classifier heads in supervised learning, grounded in Neural Collapse. It derives theoretical properties of and at Neural Collapse and demonstrates their ability to describe training dynamics, linear mode connectivity, grokking, and effects of label smoothing and pruning. The authors further integrate and as auxiliary loss terms for both supervised and semi-supervised learning, achieving improvements on standard benchmarks and better utilization of unlabeled data. Overall, the work provides a new analytical framework and practical training enhancements by linking representation–head information measures with learning dynamics and generalization performance.

Abstract

In this paper, we use matrix information theory as an analytical tool to analyze the dynamics of the information interplay between data representations and classification head vectors in the supervised learning process. Specifically, inspired by the theory of Neural Collapse, we introduce matrix mutual information ratio (MIR) and matrix entropy difference ratio (HDR) to assess the interactions of data representation and class classification heads in supervised learning, and we determine the theoretical optimal values for MIR and HDR when Neural Collapse happens. Our experiments show that MIR and HDR can effectively explain many phenomena occurring in neural networks, for example, the standard supervised training dynamics, linear mode connectivity, and the performance of label smoothing and pruning. Additionally, we use MIR and HDR to gain insights into the dynamics of grokking, which is an intriguing phenomenon observed in supervised training, where the model demonstrates generalization capabilities long after it has learned to fit the training data. Furthermore, we introduce MIR and HDR as loss terms in supervised and semi-supervised learning to optimize the information interactions among samples and classification heads. The empirical results provide evidence of the method's effectiveness, demonstrating that the utilization of MIR and HDR not only aids in comprehending the dynamics throughout the training process but can also enhances the training procedure itself.
Paper Structure (24 sections, 10 theorems, 13 equations, 7 figures, 2 tables)

This paper contains 24 sections, 10 theorems, 13 equations, 7 figures, 2 tables.

Key Result

Theorem 4.2

Suppose Neural collapse happens. Then $\operatorname{HDR}(\mathbf{G}(\mathbf{W}^T), \mathbf{G}(\mathbf{M})) = 0$ and $\operatorname{MIR}(\mathbf{G}(\mathbf{W}^T), \mathbf{G}(\mathbf{M})) = \frac{1}{C-1} + \frac{(C-2)\log(C-2)}{(C-1)\log(C-1)}$.

Figures (7)

  • Figure 1: Accuracy and MIR on the test set during training.
  • Figure 2: Accuracy and HDR on the test set during training.
  • Figure 3: Accuracy, MIR, and HDR of models interpolated with different weights on the test set.
  • Figure 4: Accuracy, MIR, and HDR of models interpolated with different weights on CIFAR-10 test set.
  • Figure 5: Accuracy, MIR, and HDR under different smoothness levels.
  • ...and 2 more figures

Theorems & Definitions (20)

  • Definition 3.1: Matrix entropy
  • Definition 3.2: Matrix mutual information
  • Definition 3.3: Matrix mutual information ratio (MIR)
  • Definition 3.4: Matrix entropy difference ratio (HDR)
  • Definition 4.1: Construction of similarity (gram) matrix
  • Theorem 4.2
  • Corollary 4.3
  • Lemma 4.4
  • Lemma 4.5
  • Theorem 4.6
  • ...and 10 more