Table of Contents
Fetching ...

DADA: Dual Averaging with Distance Adaptation

Mohammad Moshtaghifar, Anton Rodomanov, Daniil Vankov, Sebastian Stich

TL;DR

DADA introduces a universal gradient method built on dual averaging with distance adaptation that automatically tunes its coefficients using observed gradients and the distance from the initial point, removing the need for problem-specific hyperparameters. By connecting the progress to a local growth function omega and a distance-based bound D0, the method achieves convergence guarantees across broad convex function classes on potentially unbounded domains. The authors derive explicit complexity bounds for multiple smoothness regimes, including Lipschitz, Hölder, high-order Lipschitz derivatives, quasi-self-concordant, and (L0, L1)-smooth functions, with an optimal constant c giving favorable constants. The approach outperforms several distance-adaptation baselines in theory and demonstrates robust empirical performance across Softmax, Hölder-smooth, and worst-case scenarios, highlighting its practical impact for parameter-free optimization in diverse settings.

Abstract

We present a novel universal gradient method for solving convex optimization problems. Our algorithm -- Dual Averaging with Distance Adaptation (DADA) -- is based on the classical scheme of dual averaging and dynamically adjusts its coefficients based on observed gradients and the distance between iterates and the starting point, eliminating the need for problem-specific parameters. DADA is a universal algorithm that simultaneously works for a broad spectrum of problem classes, provided the local growth of the objective function around its minimizer can be bounded. Particular examples of such problem classes are nonsmooth Lipschitz functions, Lipschitz-smooth functions, Hölder-smooth functions, functions with high-order Lipschitz derivative, quasi-self-concordant functions, and $(L_0,L_1)$-smooth functions. Crucially, DADA is applicable to both unconstrained and constrained problems, even when the domain is unbounded, without requiring prior knowledge of the number of iterations or desired accuracy.

DADA: Dual Averaging with Distance Adaptation

TL;DR

DADA introduces a universal gradient method built on dual averaging with distance adaptation that automatically tunes its coefficients using observed gradients and the distance from the initial point, removing the need for problem-specific hyperparameters. By connecting the progress to a local growth function omega and a distance-based bound D0, the method achieves convergence guarantees across broad convex function classes on potentially unbounded domains. The authors derive explicit complexity bounds for multiple smoothness regimes, including Lipschitz, Hölder, high-order Lipschitz derivatives, quasi-self-concordant, and (L0, L1)-smooth functions, with an optimal constant c giving favorable constants. The approach outperforms several distance-adaptation baselines in theory and demonstrates robust empirical performance across Softmax, Hölder-smooth, and worst-case scenarios, highlighting its practical impact for parameter-free optimization in diverse settings.

Abstract

We present a novel universal gradient method for solving convex optimization problems. Our algorithm -- Dual Averaging with Distance Adaptation (DADA) -- is based on the classical scheme of dual averaging and dynamically adjusts its coefficients based on observed gradients and the distance between iterates and the starting point, eliminating the need for problem-specific parameters. DADA is a universal algorithm that simultaneously works for a broad spectrum of problem classes, provided the local growth of the objective function around its minimizer can be bounded. Particular examples of such problem classes are nonsmooth Lipschitz functions, Lipschitz-smooth functions, Hölder-smooth functions, functions with high-order Lipschitz derivative, quasi-self-concordant functions, and -smooth functions. Crucially, DADA is applicable to both unconstrained and constrained problems, even when the domain is unbounded, without requiring prior knowledge of the number of iterations or desired accuracy.
Paper Structure (28 sections, 21 theorems, 112 equations, 5 figures, 1 algorithm)

This paper contains 28 sections, 21 theorems, 112 equations, 5 figures, 1 algorithm.

Key Result

theorem 2.1

Consider alg:da for solving problem eq:problem using the coefficients from eq:dada-stepsizes with $c > \sqrt{2}$. Then, for any $T \geq 1$ and $v_T^* \coloneqq \min_{0 \leq k \leq T - 1} v(x_k)$, it holds that and where $\bar{D} \coloneqq \max\{\bar{r}, \frac{2c}{c - \sqrt{2}} D_0\}$ and $D \coloneqq \sqrt{2} (c D_0 + \frac{1}{c} \bar{D})$. Consequently, for a given $\delta > 0$, it holds that $

Figures (5)

  • Figure 4.1: Comparison of different methods on the Softmax function.
  • Figure 4.2: The ratio $\frac{D}{\bar{r}_t}$ for the Softmax function with different optimal points $x^*$.
  • Figure 4.3: Comparison of different methods on the polyhedron feasibility problem.
  • Figure 4.4: Comparison of different methods on the worst-case function.
  • Figure 4.5: The ratio $\frac{D}{\bar{r}_t}$ for the worst-case function with different optimal points $x^*$.

Theorems & Definitions (34)

  • theorem 2.1
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • Lemma A.3
  • proof
  • Lemma B.1
  • proof
  • Lemma B.2
  • ...and 24 more