DADA: Dual Averaging with Distance Adaptation
Mohammad Moshtaghifar, Anton Rodomanov, Daniil Vankov, Sebastian Stich
TL;DR
DADA introduces a universal gradient method built on dual averaging with distance adaptation that automatically tunes its coefficients using observed gradients and the distance from the initial point, removing the need for problem-specific hyperparameters. By connecting the progress to a local growth function omega and a distance-based bound D0, the method achieves convergence guarantees across broad convex function classes on potentially unbounded domains. The authors derive explicit complexity bounds for multiple smoothness regimes, including Lipschitz, Hölder, high-order Lipschitz derivatives, quasi-self-concordant, and (L0, L1)-smooth functions, with an optimal constant c giving favorable constants. The approach outperforms several distance-adaptation baselines in theory and demonstrates robust empirical performance across Softmax, Hölder-smooth, and worst-case scenarios, highlighting its practical impact for parameter-free optimization in diverse settings.
Abstract
We present a novel universal gradient method for solving convex optimization problems. Our algorithm -- Dual Averaging with Distance Adaptation (DADA) -- is based on the classical scheme of dual averaging and dynamically adjusts its coefficients based on observed gradients and the distance between iterates and the starting point, eliminating the need for problem-specific parameters. DADA is a universal algorithm that simultaneously works for a broad spectrum of problem classes, provided the local growth of the objective function around its minimizer can be bounded. Particular examples of such problem classes are nonsmooth Lipschitz functions, Lipschitz-smooth functions, Hölder-smooth functions, functions with high-order Lipschitz derivative, quasi-self-concordant functions, and $(L_0,L_1)$-smooth functions. Crucially, DADA is applicable to both unconstrained and constrained problems, even when the domain is unbounded, without requiring prior knowledge of the number of iterations or desired accuracy.
