Table of Contents
Fetching ...

Reevaluating Theoretical Analysis Methods for Optimization in Deep Learning

Hoang Tran, Qinzi Zhang, Ashok Cutkosky

TL;DR

The paper tackles the gap between theoretical optimization analyses and practice in deep learning by introducing empirical metrics that compare real optimization trajectories to analytic predictions, focusing on low-level identities rather than global assumptions. It defines and validates proxies such as instantaneous convexity gap ($inst_gap$), convexity_ratio, update_correlation ($update_corr$), and smoothness measures, across convex problems, CNNs for image classification, and LLM pretraining. Key findings show that average convexity along training paths is locally convex the majority of the time, enabling some convex-optimization reasoning, while the objective is not globally smooth and conventional smoothness-based analyses often fail to predict progress; a new update-correlation perspective reveals nuanced dynamics. The work advocates empirical verification of optimization analyses and calls for developing new theoretical tools aligned with practical training dynamics, while acknowledging limitations from focusing on a small set of optimizers. This combination of diagnostic metrics and broad empirical testing provides a pathway toward more practically relevant optimization theory.

Abstract

There is a significant gap between our theoretical understanding of optimization algorithms used in deep learning and their practical performance. Theoretical development usually focuses on proving convergence guarantees under a variety of different assumptions, which are themselves often chosen based on a rough combination of intuitive match to practice and analytical convenience. In this paper, we carefully measure the degree to which the standard optimization analyses are capable of explaining modern algorithms. To do this, we develop new empirical metrics that compare real optimization behavior with analytically predicted behavior. Our investigation is notable for its tight integration with modern optimization analysis: rather than simply checking high-level assumptions made in the analysis (e.g. smoothness), we also verify key low-level identities used by the analysis to explain optimization behavior that might hold even if the high-level motivating assumptions do not. Notably, we find that smoothness-based analyses fail in practice under most scenarios, but the key identities commonly used in convex-optimization analyses often hold in practice despite the objective's global non-convexity.

Reevaluating Theoretical Analysis Methods for Optimization in Deep Learning

TL;DR

The paper tackles the gap between theoretical optimization analyses and practice in deep learning by introducing empirical metrics that compare real optimization trajectories to analytic predictions, focusing on low-level identities rather than global assumptions. It defines and validates proxies such as instantaneous convexity gap (), convexity_ratio, update_correlation (), and smoothness measures, across convex problems, CNNs for image classification, and LLM pretraining. Key findings show that average convexity along training paths is locally convex the majority of the time, enabling some convex-optimization reasoning, while the objective is not globally smooth and conventional smoothness-based analyses often fail to predict progress; a new update-correlation perspective reveals nuanced dynamics. The work advocates empirical verification of optimization analyses and calls for developing new theoretical tools aligned with practical training dynamics, while acknowledging limitations from focusing on a small set of optimizers. This combination of diagnostic metrics and broad empirical testing provides a pathway toward more practically relevant optimization theory.

Abstract

There is a significant gap between our theoretical understanding of optimization algorithms used in deep learning and their practical performance. Theoretical development usually focuses on proving convergence guarantees under a variety of different assumptions, which are themselves often chosen based on a rough combination of intuitive match to practice and analytical convenience. In this paper, we carefully measure the degree to which the standard optimization analyses are capable of explaining modern algorithms. To do this, we develop new empirical metrics that compare real optimization behavior with analytically predicted behavior. Our investigation is notable for its tight integration with modern optimization analysis: rather than simply checking high-level assumptions made in the analysis (e.g. smoothness), we also verify key low-level identities used by the analysis to explain optimization behavior that might hold even if the high-level motivating assumptions do not. Notably, we find that smoothness-based analyses fail in practice under most scenarios, but the key identities commonly used in convex-optimization analyses often hold in practice despite the objective's global non-convexity.
Paper Structure (27 sections, 15 equations, 19 figures, 1 table)

This paper contains 27 sections, 15 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: Instantaneous convexity gap w.r.t. $\mathbf{x}_{t-1}$ of GD on squared loss (left) and Logistic Regression with OpenML datasets OpenML2013 (right).
  • Figure 2: Average and exponential average convexity gap w.r.t. $\mathbf{y}_t=\mathbf{x}_{t-1}$. A negative value implies that the objective is locally convex in average.
  • Figure 3: Convexity ratios of deep learning benchmarks where $\mathbf{x}^\star$ is the final iterate from the same training run. A convexity ratio greater than 1 indicates a convex function. Ratios between 0 and 1 suggest "weak quasi-convexity". Ratios less than 0 denote strong non-convexity. See a complementary result where $\mathbf{x}^\star$ is from a different training run in Appendix \ref{['app:convex-ratio']}.
  • Figure 4: Smoothness measures w.r.t. $\mathbf{x}_{t-1}$. Experiments use optimal learning rate scheduler; see a complementary result that uses constant learning rates in Appendix \ref{['app:smoothness']}.
  • Figure 5: Sharpness v.s. Smoothness Measures.
  • ...and 14 more figures