Table of Contents
Fetching ...

Stochastic Thermodynamics of Learning Parametric Probabilistic Models

Shervin Sadat Parsi

TL;DR

This work reframes learning of Parametric Probabilistic Models as a thermodynamic process, introducing Memorized Information (M-info) and Learned Information (L-info) to quantify information stored in parameters and task-aligned learning, respectively. By modeling the joint dynamics of model outputs X and parameters Θ with lagged bipartite dynamics and Local Detailed Balance, it links information flow to entropy production and identifies Θ as a high-capacity heat reservoir that stores learned information through the learned-data exchange. Using the Detailed Fluctuation Theorem, the authors connect L-info to interval entropy production, and describe an ideal, quasi-static learning regime where all memorized information is relevant and conditional entropy production vanishes, at the cost of increased computation. The framework provides a thermodynamic explanation for why over-parameterization and slow, lazy dynamics can aid generalization, while offering a principled path to diagnose and quantify information flow and energy exchange during training under both naive and more realistic reservoir models.

Abstract

We have formulated a family of machine learning problems as the time evolution of Parametric Probabilistic Models (PPMs), inherently rendering a thermodynamic process. Our primary motivation is to leverage the rich toolbox of thermodynamics of information to assess the information-theoretic content of learning a probabilistic model. We first introduce two information-theoretic metrics: Memorized-information (M-info) and Learned-information (L-info), which trace the flow of information during the learning process of PPMs. Then, we demonstrate that the accumulation of L-info during the learning process is associated with entropy production, and parameters serve as a heat reservoir in this process, capturing learned information in the form of M-info.

Stochastic Thermodynamics of Learning Parametric Probabilistic Models

TL;DR

This work reframes learning of Parametric Probabilistic Models as a thermodynamic process, introducing Memorized Information (M-info) and Learned Information (L-info) to quantify information stored in parameters and task-aligned learning, respectively. By modeling the joint dynamics of model outputs X and parameters Θ with lagged bipartite dynamics and Local Detailed Balance, it links information flow to entropy production and identifies Θ as a high-capacity heat reservoir that stores learned information through the learned-data exchange. Using the Detailed Fluctuation Theorem, the authors connect L-info to interval entropy production, and describe an ideal, quasi-static learning regime where all memorized information is relevant and conditional entropy production vanishes, at the cost of increased computation. The framework provides a thermodynamic explanation for why over-parameterization and slow, lazy dynamics can aid generalization, while offering a principled path to diagnose and quantify information flow and energy exchange during training under both naive and more realistic reservoir models.

Abstract

We have formulated a family of machine learning problems as the time evolution of Parametric Probabilistic Models (PPMs), inherently rendering a thermodynamic process. Our primary motivation is to leverage the rich toolbox of thermodynamics of information to assess the information-theoretic content of learning a probabilistic model. We first introduce two information-theoretic metrics: Memorized-information (M-info) and Learned-information (L-info), which trace the flow of information during the learning process of PPMs. Then, we demonstrate that the accumulation of L-info during the learning process is associated with entropy production, and parameters serve as a heat reservoir in this process, capturing learned information in the form of M-info.
Paper Structure (15 sections, 42 equations, 3 figures, 1 table)

This paper contains 15 sections, 42 equations, 3 figures, 1 table.

Figures (3)

  • Figure 3.1: The learning trajectory $\mathcal{T}$ depicts the thermodynamic process that take the initial model state to final state. The green area shows the space of family of distribution accessible to the PPM. The red area considers the possibility that the target distribution, $p^*$, is not in this family.
  • Figure 3.2: This figure shows Bayesian network for joint trajectory probability $P[\boldsymbol{x}_n, \boldsymbol{\theta}_n ]$, based on a dual timescale bipartite dynamics.
  • Figure 4.1: This experiment contrasts the parameter dynamics with three different mini-batch sizes: $|b_t |= 1$, $|b_t |= 10$ and $|b_t| = 100$. The model under consideration is a four-layer feedforward neural network with a uniform width of 200 neurons. It was trained on the MNIST classification task using a vanilla SGD optimizer. The experiment was replicated over 50 trials to generate an ensemble of parameters. a) One random parameter from the model's last layer is chosen for each batch size scenario, and four of its dynamic realizations are depicted. b) Illustrates both the average accuracy (solid line) and the variance of accuracy within the ensemble (shaded area), emphasizing the low-variance condition, which asserts that macroscopic quantities such as accuracy have low variance statistics across the ensemble. c) Displays the noise variance averaged over all parameters, i.e., $\frac{1}{\dim(\theta)}\sum^{\dim(\theta)}_{i=0} C_{i,i}(t,0)$, for each mini-batch size scenario, underscoring the stationary nature of $\eta$. This part also highlights the role of mini-batch size in determining the noise width, i.e., the temperature of the environment. The horizontal dashed line indicates the maximum absolute value observed from $\nabla_\theta~ U_B(\theta_{t_n})$, serving as a reference point for the magnitude of the noise. d) Exhibits the autocorrelation of the term $\eta$ averaged over all parameters. For instance, computing this quantity at step 1000 reads: $\frac{1}{\dim(\theta)}\sum^{\dim(\theta)}_{i=0} C_{i,i}(t=1000,t'-t)$. The rapid decline in autocorrelation with time lag indicating the white noise characteristic of $\eta$.