Learning Non-Vacuous Generalization Bounds from Optimization

Chengli Tan; Jiangshe Zhang; Junmin Liu

Learning Non-Vacuous Generalization Bounds from Optimization

Chengli Tan, Jiangshe Zhang, Junmin Liu

TL;DR

This work addresses the gap in generalization theory for deep nets by deriving a non-vacuous, algorithm-dependent bound for SGD through a fractal-geometry lens. By modeling SGD as an SDE driven by fractional Brownian motion, the authors connect the generalization gap to the Hausdorff dimension of the reachable hypothesis set, enabling a sublinear $\mathcal{O}(1/\sqrt{m})$ decay under their assumptions. They provide a practical estimation procedure for the bound, including how to estimate the Lipschitz constant, hypothesis-set diameter, and the fractal dimension via Hurst exponents of stochastic gradient noise. Empirical results on CIFAR-10/100 and ImageNet-1K demonstrate that the bound is non-vacuous and tracks the true generalization gap, with tighter predictions when increasing data and when adjusting the learning-rate-to-batch-size ratio. Overall, the approach offers a scalable, architecture-agnostic bound and suggests that optimization dynamics and fractal structure play key roles in generalization, with potential extensions to adaptive optimizers.

Abstract

One of the fundamental challenges in the deep learning community is to theoretically understand how well a deep neural network generalizes to unseen data. However, current approaches often yield generalization bounds that are either too loose to be informative of the true generalization error or only valid to the compressed nets. In this study, we present a simple yet non-vacuous generalization bound from the optimization perspective. We achieve this goal by leveraging that the hypothesis set accessed by stochastic gradient algorithms is essentially fractal-like and thus can derive a tighter bound over the algorithm-dependent Rademacher complexity. The main argument rests on modeling the discrete-time recursion process via a continuous-time stochastic differential equation driven by fractional Brownian motion. Numerical studies demonstrate that our approach is able to yield plausible generalization guarantees for modern neural networks such as ResNet and Vision Transformer, even when they are trained on a large-scale dataset (e.g. ImageNet-1K).

Learning Non-Vacuous Generalization Bounds from Optimization

TL;DR

decay under their assumptions. They provide a practical estimation procedure for the bound, including how to estimate the Lipschitz constant, hypothesis-set diameter, and the fractal dimension via Hurst exponents of stochastic gradient noise. Empirical results on CIFAR-10/100 and ImageNet-1K demonstrate that the bound is non-vacuous and tracks the true generalization gap, with tighter predictions when increasing data and when adjusting the learning-rate-to-batch-size ratio. Overall, the approach offers a scalable, architecture-agnostic bound and suggests that optimization dynamics and fractal structure play key roles in generalization, with potential extensions to adaptive optimizers.

Abstract

Paper Structure (16 sections, 2 theorems, 33 equations, 7 figures, 1 table)

This paper contains 16 sections, 2 theorems, 33 equations, 7 figures, 1 table.

Introduction
Preliminaries
Fractional Brownian Motion
Fractal Dimension
Non-Vacuous Generalization Bound for SGD
Problem Setup
Main Assumptions
Upper Bound
Estimation
Numerical Studies
Implementation Details
Number of Training Examples
Effects of Learning Rate and Mini-batch Size
Results on ImageNet-1K
Comparison with Existing Estimators
...and 1 more sections

Key Result

Theorem 1

Let Assumptions assumption: lipschitz continuous-assumption: regularity hold. For any i.i.d. sample $S\in\mathcal{Z}^m$, there always exist a constant $c\geq 1$ such that the following inequality holds:

Figures (7)

Figure 1: Histogram of Hurst exponents for all coordinates of ResNet-20. For each coordinate, we first generate a series of stochastic gradient noise (SGN) and then estimate its Hurst exponent. If the elements of a time series are mutually independent, for example, in the case of the Brownian motion and the Lévy flight, the corresponding Hurst exponent would be $1/2$embrechts2009selfsimilar. Otherwise, it would suggest that the elements are not independent.
Figure 2: Sample paths of FBM in two-dimensional space. The colors indicate the evolution over time. The Hurst exponent $H$ corresponds to the raggedness of the sample path, with a higher value leading to a smoother motion.
Figure 3: Norm of the true gradient and the stochastic gradient as a function of training epoch, where the mini-batch size is 128.
Figure 4: Upper bound $\varrho_{\mathrm{bound}}$ and true generalization gap as a function of the number of training examples.
Figure 5: Negative correlation between the upper bound $\varrho_{\mathrm{bound}}$ and the ratio of learning rate to mini-batch size.
...and 2 more figures

Theorems & Definitions (8)

Definition 1
Theorem 1
proof
Theorem 2
proof
Remark 1
Remark 2
Remark 3

Learning Non-Vacuous Generalization Bounds from Optimization

TL;DR

Abstract

Learning Non-Vacuous Generalization Bounds from Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (8)