Probability Distribution Learning and Its Application in Deep Learning
Binchuan Qi, Wei Gong, Li Li
TL;DR
This work introduces the probability distribution (PD) learning framework, reframing deep learning as conditional distribution learning of $q_{\mathcal{Y}|\mathcal{X}}$ and proving that any suitable loss is equivalent to the Fenchel-Young loss. It generalizes strong convexity and Lipschitz smoothness to $\mathcal{H}(\psi)$-convexity and $\mathcal{H}(\Psi)$-smoothness, offering theoretical guarantees for SGD optimization in non-convex settings and deriving model-independent bounds on risk and generalization. The paper also shows how prior knowledge can be incorporated via a convex constraint set $C$ to shape the loss through Legendre-Fenchel duality, produces bounds illustrating the roles of data size, regularization, mutual information, and information loss due to irreversibility, and provides empirical validation across standard datasets. Overall, PD learning yields a unified, distribution-centric theory that links optimization dynamics to generalization behavior in deep networks, supported by both theory and experiments.
Abstract
Despite its empirical success, deep learning still lacks a comprehensive theoretical understanding of model fitting and generalization. This paper proposes the probability distribution (PD) learning framework to analyze the optimization and generalization mechanisms of deep learning. Within this framework, the conditional distribution of labels given features is the primary learning target, with the loss function, prior knowledge, and model properties explicitly characterized. Under these formulations, we establish theoretical guarantees on optimizability, even in non-convex settings, and derive generalization error bounds that provide meaningful explanations for practical performance. Specifically, we first prove theoretically that the Fenchel-Young loss is the natural and necessary choice for solving PD learning problems, thereby justifying the generality of conclusions based on this loss. Second, to capture the characteristics of deep neural networks (DNNs), we introduce the notions of $\mathcal{H}(ψ)$-convexity and $\mathcal{H}(Ψ)$-smoothness, which generalize the classical concepts of strong convexity and Lipschitz smoothness. Based on them, we provide a theoretical explanation for the effectiveness of SGD in training DNNs. Finally, we derive model-independent bounds on the expected risk and generalization error for trained models, revealing the influence of the training set size, regularization term, the mutual information between labels and features, and the information loss caused by model irreversibility on risk and generalization. Based on our theoretical analysis and experimental validation, we believe that the PD learning framework facilitates a deeper and more unified theoretical understanding of deep learning.
