Table of Contents
Fetching ...

Estimating Sequences with Memory for Minimizing Convex Non-smooth Composite Functions

Endrit Dosti, Sergiy A. Vorobyov, Themistoklis Charalambous

TL;DR

This work develops a memory-augmented generalization of estimating sequences for convex non-smooth composite optimization, enabling acceleration in black-box settings $F(\boldsymbol x)=f(\boldsymbol x)+\tau g(\boldsymbol x)$ with a non-smooth term. The proposed method uses generalized composite estimating sequences that incorporate a memory term $\psi_k$ and a reduced gradient, together with a backtracking line-search that makes the algorithm robust to unknown Lipschitz constants and imperfect strong convexity knowledge. Theoretical results establish an accelerated convergence rate and robustness guarantees, while numerical experiments on quadratic and logistic losses (including real LIBSVM datasets) show improved performance and monotonicity over benchmarks like AMGS and FISTA. The approach is practical for large-scale data processing and can be extended to stochastic, higher-order, or nonconvex settings. Overall, the memory-based framework broadens the applicability and reliability of first-order acceleration in composite convex optimization.

Abstract

First-order optimization methods are crucial for solving large-scale data processing problems, particularly those involving convex non-smooth composite objectives. For such problems with convex non-smooth composite objectives, we introduce a new class of generalized composite estimating sequences, devised by exploiting the information embedded in the iterates generated during the minimization process. Building on these sequences, we propose a novel accelerated first-order method tailored for such objective structures. This method features a backtracking line-search strategy and achieves an accelerated convergence rate, regardless of whether the true Lipschitz constant is known. Additionally, it exhibits robustness to imperfect knowledge of the strong convexity parameter, a property of significant practical importance. The method's efficiency and robustness are substantiated by comprehensive numerical evaluations on both synthetic and real-world datasets, demonstrating its effectiveness in data processing applications.

Estimating Sequences with Memory for Minimizing Convex Non-smooth Composite Functions

TL;DR

This work develops a memory-augmented generalization of estimating sequences for convex non-smooth composite optimization, enabling acceleration in black-box settings with a non-smooth term. The proposed method uses generalized composite estimating sequences that incorporate a memory term and a reduced gradient, together with a backtracking line-search that makes the algorithm robust to unknown Lipschitz constants and imperfect strong convexity knowledge. Theoretical results establish an accelerated convergence rate and robustness guarantees, while numerical experiments on quadratic and logistic losses (including real LIBSVM datasets) show improved performance and monotonicity over benchmarks like AMGS and FISTA. The approach is practical for large-scale data processing and can be extended to stochastic, higher-order, or nonconvex settings. Overall, the memory-based framework broadens the applicability and reliability of first-order acceleration in composite convex optimization.

Abstract

First-order optimization methods are crucial for solving large-scale data processing problems, particularly those involving convex non-smooth composite objectives. For such problems with convex non-smooth composite objectives, we introduce a new class of generalized composite estimating sequences, devised by exploiting the information embedded in the iterates generated during the minimization process. Building on these sequences, we propose a novel accelerated first-order method tailored for such objective structures. This method features a backtracking line-search strategy and achieves an accelerated convergence rate, regardless of whether the true Lipschitz constant is known. Additionally, it exhibits robustness to imperfect knowledge of the strong convexity parameter, a property of significant practical importance. The method's efficiency and robustness are substantiated by comprehensive numerical evaluations on both synthetic and real-world datasets, demonstrating its effectiveness in data processing applications.

Paper Structure

This paper contains 8 sections, 8 theorems, 81 equations, 3 figures, 1 algorithm.

Key Result

Theorem 1

Let $F({\boldsymbol x})$ be a composition of an $L_{\hat{f}}$-smooth and $\mu_{\hat{f}}$-strongly convex function $\hat{f}({\boldsymbol x})$, and a simple convex function $\hat{g}({\boldsymbol x})$, as given in FFF. For $L \geq L_{\hat{f}}$, and ${\boldsymbol x}, \, {\boldsymbol y} \in \mathcal{R}^n

Figures (3)

  • Figure 1: Performance evaluation of our proposed method and the selected benchmarks on synthetic data. We consider quadratic objective function and elastic net regularizer. (a) Evaluating the distance to ${\boldsymbol x}^*$, $m = 500$, $\kappa = 10^3$ and $\tau_1 = \tau_2 = 10^{-3}$. Note that the curves for Proposed 1 and Proposed 2 almost fully overlap. (b) Convergence of the terms $\{\gamma_k\}_k$, $m = 500$, $\kappa = 10^3$ and $\tau_1 = \tau_2 = 10^{-3}$. Note that the curves for Proposed 2 and Proposed 3 almost fully overlap $\forall k$, and then also fully overlap with the curve for Proposed 1 for $k$ larger than 180. (c) Evaluating the distance to ${\boldsymbol x}^*$, $m = 1000$, $\kappa = 10^{7}$ and $\tau_1 = \tau_2 = 10^{-7}$. Note that the curves for Proposed 1 and Proposed 2 fully overlap, that is, Proposed 1 and Proposed 1 have completely identical performance. (d) Convergence of the terms $\{\gamma_k\}_k$, $m = 1000$, $\kappa = 10^7$ and $\tau_1 = \tau_2 = 10^{-7}$. Note that the curves for Proposed 2 and Proposed 3 fully overlap $\forall k$, and then also fully overlap with the curve for Proposed 1 for $k$ larger than about 8000.
  • Figure 2: Performance evaluation of our proposed method and the selected benchmarks on the "a1a" dataset. We consider quadratic objective function and elastic net regularizer, and assume that the true value of $L_{\hat{f}}$ is not known. (a) Evaluating the distance to ${\boldsymbol x}^*$ for "a1a" dataset, $L_0 = 0.1 L_{\text{"a1a"}}$ and $\tau_1 = \tau_2 = 10^{-4}$. (b) Evaluating the distance to ${\boldsymbol x}^*$ for "a1a" dataset, $L_0 = 0.1 L_{\text{"a1a"}}$ and $\tau_1 = \tau_2 = 10^{-5}$. Note that the curves for Proposed 2 and Proposed 3 almost fully overlap. (c) Evaluating the distance to ${\boldsymbol x}^*$ for "a1a" dataset, $L_0 = 10 L_{\text{"a1a"}}$ and $\tau_1 = \tau_2 = 10^{-4}$. (d) Evaluating the distance to ${\boldsymbol x}^*$ for "a1a" dataset, $L_0 = 10 L_{\text{"a1a"}}$ and $\tau_1 = \tau_2 = 10^{-5}$.
  • Figure 3: Performance evaluation of our proposed method and the selected benchmarks on real data. We consider the logistic objective function and elastic net regularizer. (a) Evaluating the distance to ${\boldsymbol x}^*$ for "rcv1.binary" dataset, $\tau_1 = \tau_2 = 10^{-4}$. (b) Evaluating the distance to ${\boldsymbol x}^*$ for "rcv1.binary" dataset, $\tau_1 = \tau_2 = 10^{-5}$. (c) Evaluating the distance to ${\boldsymbol x}^*$ for "triazine" dataset, $\tau_1 = \tau_2 = 10^{-6}$. (d) Evaluating the distance to ${\boldsymbol x}^*$ for "triazine" dataset, $\tau_1 = \tau_2 = 10^{-7}$.

Theorems & Definitions (15)

  • Theorem 1
  • Theorem 2
  • Definition 1
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 3
  • Lemma 4
  • Theorem 4
  • proof
  • ...and 5 more