Learning complexity of gradient descent and conjugate gradient algorithms
Xianqi Jiao, Jia Liu, Zhiping Chen
TL;DR
The paper introduces a data-driven framework to quantify the learning complexity of tuning gradient-based optimization algorithms (GD and CG) via algorithm selection. A key contribution is a primal-dual integral-inspired cost that remains meaningful when iterations are halted and provides bounded pseudo-dimensions, enabling ERM-based guarantees. It establishes that GD admits a (C+ε,δ)-learnability with sample complexity m = ̃O(H^3/ε^2) under the new cost, and extends the approach to CG, yielding m = ̃O(H^4/ε^2). Together, these results demonstrate that optimal step-size and conjugate-parameter configurations can be identified with high probability from a finite sample, advancing data-driven algorithm design for large-scale optimization problems.
Abstract
Gradient Descent (GD) and Conjugate Gradient (CG) methods are among the most effective iterative algorithms for solving unconstrained optimization problems, particularly in machine learning and statistical modeling, where they are employed to minimize cost functions. In these algorithms, tunable parameters, such as step sizes or conjugate parameters, play a crucial role in determining key performance metrics, like runtime and solution quality. In this work, we introduce a framework that models algorithm selection as a statistical learning problem, and thus learning complexity can be estimated by the pseudo-dimension of the algorithm group. We first propose a new cost measure for unconstrained optimization algorithms, inspired by the concept of primal-dual integral in mixed-integer linear programming. Based on the new cost measure, we derive an improved upper bound for the pseudo-dimension of gradient descent algorithm group by discretizing the set of step size configurations. Moreover, we generalize our findings from gradient descent algorithm to the conjugate gradient algorithm group for the first time, and prove the existence a learning algorithm capable of probabilistically identifying the optimal algorithm with a sufficiently large sample size.
