Table of Contents
Fetching ...

Information-Theoretic Perspectives on Optimizers

Zhiquan Tan, Weiran Huang

TL;DR

The paper addresses why optimizers interact with neural architectures in nontrivial ways and introduces the entropy gap, defined as log n - H, as an information-theoretic metric to complement sharpness. It develops an information-dynamics framework and PAC-Bayes bounds, showing that both entropy gap and Hessian-based sharpness shape training dynamics and generalization, with a posterior approximation P(mu|D) ≈ N(mu*, (H + 1/sigma^2 I)^{-1}). For the Lion optimizer, the authors model updates through an information bottleneck lens and derive a tanh-based representation T = tanh((mu/sigma^2) x), suggesting practical improvements via more compact information transfer. Overall, the work provides a principled framework combining information theory with optimization to guide the design of faster, more generalizable optimizers and highlights applicability to architectures like ResNet and ViT.

Abstract

The interplay of optimizers and architectures in neural networks is complicated and hard to understand why some optimizers work better on some specific architectures. In this paper, we find that the traditionally used sharpness metric does not fully explain the intricate interplay and introduces information-theoretic metrics called entropy gap to better help analyze. It is found that both sharpness and entropy gap affect the performance, including the optimization dynamic and generalization. We further use information-theoretic tools to understand a recently proposed optimizer called Lion and find ways to improve it.

Information-Theoretic Perspectives on Optimizers

TL;DR

The paper addresses why optimizers interact with neural architectures in nontrivial ways and introduces the entropy gap, defined as log n - H, as an information-theoretic metric to complement sharpness. It develops an information-dynamics framework and PAC-Bayes bounds, showing that both entropy gap and Hessian-based sharpness shape training dynamics and generalization, with a posterior approximation P(mu|D) ≈ N(mu*, (H + 1/sigma^2 I)^{-1}). For the Lion optimizer, the authors model updates through an information bottleneck lens and derive a tanh-based representation T = tanh((mu/sigma^2) x), suggesting practical improvements via more compact information transfer. Overall, the work provides a principled framework combining information theory with optimization to guide the design of faster, more generalizable optimizers and highlights applicability to architectures like ResNet and ViT.

Abstract

The interplay of optimizers and architectures in neural networks is complicated and hard to understand why some optimizers work better on some specific architectures. In this paper, we find that the traditionally used sharpness metric does not fully explain the intricate interplay and introduces information-theoretic metrics called entropy gap to better help analyze. It is found that both sharpness and entropy gap affect the performance, including the optimization dynamic and generalization. We further use information-theoretic tools to understand a recently proposed optimizer called Lion and find ways to improve it.

Paper Structure

This paper contains 10 sections, 10 theorems, 88 equations, 5 tables, 1 algorithm.

Key Result

Theorem 2.2

Suppose the Hessian matrix $\mathbf{H} \in \mathbb{R}^{n \times n}$ is positive definite and $k$ is the condition number of it, then the following bound holds: $\log n - \operatorname{H}_1(\mathbf{H}) \leq \frac{k}{k+n-1} \log k - \log \frac{n+k-1}{n} \leq \log k$.

Theorems & Definitions (22)

  • Definition 2.1: $\alpha$-order matrix entropy
  • Theorem 2.2
  • proof
  • Theorem 2.3
  • proof
  • Theorem 2.4: Information-Theoretic PAC-Bayes Generalization Bound
  • proof
  • Lemma 2.5
  • proof
  • Lemma 2.6
  • ...and 12 more