Table of Contents
Fetching ...

Redundancy as a Structural Information Principle for Learning and Generalization

Yuda Bi, Ying Zhu, Vince D Calhoun

TL;DR

Redundancy is reframed as a structural information principle that unifies various measures of dependence under an $f$-divergence geometry. The master definition $\mathcal{R}_f(X)=D_f(P_X\|\Pi_X)$ reveals that mutual information, covariance proxies, and spectral redundancy are projections of the same redundancy geometry. The paper proves the existence of an interior redundancy equilibrium $R^{*}$ that balances over-compression and over-coupling, and shows that learning systems self-organize toward this balance, with MAE experiments illustrating peak generalization near $R^{*}$. This framework connects information theory and finite-regime learning, offering a tunable quantity to guide the design of robust, generalizable representations and foundation models.

Abstract

We present a theoretical framework that extends classical information theory to finite and structured systems by redefining redundancy as a fundamental property of information organization rather than inefficiency. In this framework, redundancy is expressed as a general family of informational divergences that unifies multiple classical measures, such as mutual information, chi-squared dependence, and spectral redundancy, under a single geometric principle. This reveals that these traditional quantities are not isolated heuristics but projections of a shared redundancy geometry. The theory further predicts that redundancy is bounded both above and below, giving rise to an optimal equilibrium that balances over-compression (loss of structure) and over-coupling (collapse). While classical communication theory favors minimal redundancy for transmission efficiency, finite and structured systems, such as those underlying real-world learning, achieve maximal stability and generalization near this equilibrium. Experiments with masked autoencoders are used to illustrate and verify this principle: the model exhibits a stable redundancy level where generalization peaks. Together, these results establish redundancy as a measurable and tunable quantity that bridges the asymptotic world of communication and the finite world of learning.

Redundancy as a Structural Information Principle for Learning and Generalization

TL;DR

Redundancy is reframed as a structural information principle that unifies various measures of dependence under an -divergence geometry. The master definition reveals that mutual information, covariance proxies, and spectral redundancy are projections of the same redundancy geometry. The paper proves the existence of an interior redundancy equilibrium that balances over-compression and over-coupling, and shows that learning systems self-organize toward this balance, with MAE experiments illustrating peak generalization near . This framework connects information theory and finite-regime learning, offering a tunable quantity to guide the design of robust, generalizable representations and foundation models.

Abstract

We present a theoretical framework that extends classical information theory to finite and structured systems by redefining redundancy as a fundamental property of information organization rather than inefficiency. In this framework, redundancy is expressed as a general family of informational divergences that unifies multiple classical measures, such as mutual information, chi-squared dependence, and spectral redundancy, under a single geometric principle. This reveals that these traditional quantities are not isolated heuristics but projections of a shared redundancy geometry. The theory further predicts that redundancy is bounded both above and below, giving rise to an optimal equilibrium that balances over-compression (loss of structure) and over-coupling (collapse). While classical communication theory favors minimal redundancy for transmission efficiency, finite and structured systems, such as those underlying real-world learning, achieve maximal stability and generalization near this equilibrium. Experiments with masked autoencoders are used to illustrate and verify this principle: the model exhibits a stable redundancy level where generalization peaks. Together, these results establish redundancy as a measurable and tunable quantity that bridges the asymptotic world of communication and the finite world of learning.

Paper Structure

This paper contains 53 sections, 14 theorems, 59 equations, 2 figures, 2 tables.

Key Result

Proposition 1

For any convex function $f:\mathbb{R}_{>0}\to\mathbb{R}$ satisfying $f(1)=0$, the redundancy functional satisfies

Figures (2)

  • Figure 1: U-shaped redundancy–performance relationship. CIFAR-100 linear-probe Top-1 accuracy (vertical axis) varies non-monotonically with the spectral redundancy $\mathcal{R}_{\mathrm{spec}}$ (horizontal axis). Both over-redundant (high $\mathcal{R}_{\mathrm{spec}}$) and under-redundant (low $\mathcal{R}_{\mathrm{spec}}$) models exhibit degraded performance, while an intermediate redundancy ($\lambda_{\mathrm{red}}{=}10^{-2}$, $\mathcal{R}_{\mathrm{spec}}\!\approx\!0.5$) achieves the best generalization (41.4%). The dashed line marks the fitted optimum $R^{*}$ predicted by Theorem \ref{['thm:Ushape']}, and shaded regions indicate under- and over-redundant regimes.
  • Figure 2: Training dynamics of effective rank and spectral redundancy across different redundancy regularization strengths $\lambda_{\mathrm{red}}$. Solid lines denote the evolution of effective rank, and dashed lines denote spectral redundancy. Moderate regularization ($\lambda_{\mathrm{red}}\!=\!10^{-2}$) yields a stable equilibrium where redundancy and rank co-stabilize, while excessive regularization ($5\times10^2$) collapses redundancy. The relationship between $\lambda_{\mathrm{red}}$ and redundancy is markedly non-linear, confirming that redundancy regulation induces a self-organizing equilibrium rather than monotonic suppression.

Theorems & Definitions (41)

  • Definition 1: Unified redundancy functional
  • Remark 1: Specializations and the role of the kernel $f$
  • Proposition 1: Nonnegativity and vanishing iff independence
  • proof : Sketch
  • Proposition 2: Data processing inequality (DPI)
  • proof : Sketch
  • Proposition 3: Bounds
  • Proposition 4: Gaussian total correlation
  • proof : Sketch
  • Proposition 5: Quadratic approximation and covariance form
  • ...and 31 more