Probability Distribution Learning and Its Application in Deep Learning

Binchuan Qi; Wei Gong; Li Li

Probability Distribution Learning and Its Application in Deep Learning

Binchuan Qi, Wei Gong, Li Li

TL;DR

This work introduces the probability distribution (PD) learning framework, reframing deep learning as conditional distribution learning of $q_{\mathcal{Y}|\mathcal{X}}$ and proving that any suitable loss is equivalent to the Fenchel-Young loss. It generalizes strong convexity and Lipschitz smoothness to $\mathcal{H}(\psi)$-convexity and $\mathcal{H}(\Psi)$-smoothness, offering theoretical guarantees for SGD optimization in non-convex settings and deriving model-independent bounds on risk and generalization. The paper also shows how prior knowledge can be incorporated via a convex constraint set $C$ to shape the loss through Legendre-Fenchel duality, produces bounds illustrating the roles of data size, regularization, mutual information, and information loss due to irreversibility, and provides empirical validation across standard datasets. Overall, PD learning yields a unified, distribution-centric theory that links optimization dynamics to generalization behavior in deep networks, supported by both theory and experiments.

Abstract

Despite its empirical success, deep learning still lacks a comprehensive theoretical understanding of model fitting and generalization. This paper proposes the probability distribution (PD) learning framework to analyze the optimization and generalization mechanisms of deep learning. Within this framework, the conditional distribution of labels given features is the primary learning target, with the loss function, prior knowledge, and model properties explicitly characterized. Under these formulations, we establish theoretical guarantees on optimizability, even in non-convex settings, and derive generalization error bounds that provide meaningful explanations for practical performance. Specifically, we first prove theoretically that the Fenchel-Young loss is the natural and necessary choice for solving PD learning problems, thereby justifying the generality of conclusions based on this loss. Second, to capture the characteristics of deep neural networks (DNNs), we introduce the notions of $\mathcal{H}(ψ)$-convexity and $\mathcal{H}(Ψ)$-smoothness, which generalize the classical concepts of strong convexity and Lipschitz smoothness. Based on them, we provide a theoretical explanation for the effectiveness of SGD in training DNNs. Finally, we derive model-independent bounds on the expected risk and generalization error for trained models, revealing the influence of the training set size, regularization term, the mutual information between labels and features, and the information loss caused by model irreversibility on risk and generalization. Based on our theoretical analysis and experimental validation, we believe that the PD learning framework facilitates a deeper and more unified theoretical understanding of deep learning.

Probability Distribution Learning and Its Application in Deep Learning

TL;DR

This work introduces the probability distribution (PD) learning framework, reframing deep learning as conditional distribution learning of

and proving that any suitable loss is equivalent to the Fenchel-Young loss. It generalizes strong convexity and Lipschitz smoothness to

-convexity and

-smoothness, offering theoretical guarantees for SGD optimization in non-convex settings and deriving model-independent bounds on risk and generalization. The paper also shows how prior knowledge can be incorporated via a convex constraint set

to shape the loss through Legendre-Fenchel duality, produces bounds illustrating the roles of data size, regularization, mutual information, and information loss due to irreversibility, and provides empirical validation across standard datasets. Overall, PD learning yields a unified, distribution-centric theory that links optimization dynamics to generalization behavior in deep networks, supported by both theory and experiments.

Abstract

-convexity and

-smoothness, which generalize the classical concepts of strong convexity and Lipschitz smoothness. Based on them, we provide a theoretical explanation for the effectiveness of SGD in training DNNs. Finally, we derive model-independent bounds on the expected risk and generalization error for trained models, revealing the influence of the training set size, regularization term, the mutual information between labels and features, and the information loss caused by model irreversibility on risk and generalization. Based on our theoretical analysis and experimental validation, we believe that the PD learning framework facilitates a deeper and more unified theoretical understanding of deep learning.

Paper Structure (34 sections, 15 theorems, 78 equations, 6 figures)

This paper contains 34 sections, 15 theorems, 78 equations, 6 figures.

Introduction
Related work
Preliminaries
Notation
Lemmas
Setting and basics
Framework of PD learning
Problem formulation and framework definition
Definitions of key terms in PD learning
Loss function and prior knowledge in PD learning
Unified form of the loss function
Loss function design with incorporated prior knowledge
Optimization of PD learning
Extended convexity and extended smoothness
Convergence of SGD under Extended Smoothness
...and 19 more sections

Key Result

Lemma 1

It is well-known that when $\Phi = \Psi + I_C$, it follows that $\Phi^*(x) = \inf_{x' \in \mathbb R^d} \sigma_C(x') + \Psi^*(x - x')$, where we used $I_C^*(x) = \sigma_C(x)= \max_{y \in C} \langle x,y\}$Todd2003ConvexAABeck2012SmoothingAF .

Figures (6)

Figure 1: Schematic illustration of the PD learning framework. This diagram presents the complete information processing pipeline from input data to latent distribution approximation, highlighting the relationships among model prediction, conjugate transformation, and distributional distance measurement.
Figure 2: Model architectures and configuration parameters. The gray blocks in the diagram indicate components whose number of iterations can be adjusted via the parameter $k$. Model B and Model D are variants of Model A and Model C, respectively, incorporating skip connections. The symbol $I$ denotes an identity function transformation.
Figure 3: Training dynamics of the risk along with its upper and lower bounds, and the local Pearson correlation coefficients between them. The red and orange curves represent the local Pearson correlation coefficients, plotted against the right $y$-axis with range $[-1, 1]$. The risk and its bounds are shown on the left $y$-axis.
Figure 4: Training dynamics of the gradient energy and extreme eigenvalues.
Figure 5: Training dynamics of the local Pearson correlation coefficient on CIFAR-10 and CIFAR-100. In each subplot, the horizontal axis represents the epoch index, with values ranging from 0 to 40, while the vertical axis denotes the local Pearson correlation coefficient.
...and 1 more figures

Theorems & Definitions (19)

Lemma 1
Lemma 2: Euler's theorem for homogeneous functions
Lemma 3
Definition 1: PD learning
Proposition 1: Uniqueness of Fenchel-Young Representation
Proposition 2: Decomposition of Risk in PD learning
Definition 2: $L_2$-norm power function
Lemma 4
Definition 3
Lemma 5
...and 9 more

Probability Distribution Learning and Its Application in Deep Learning

TL;DR

Abstract

Probability Distribution Learning and Its Application in Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (19)