Table of Contents
Fetching ...

FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic Optimization

Zhanhong Jiang, Md Zahid Hasan, Aditya Balu, Joshua R. Waite, Genyi Huang, Soumik Sarkar

TL;DR

The paper tackles the inefficiency of relying solely on first-order methods or incurring high cost with second-order methods in stochastic optimization for ML. It proposes FUSE, a unified framework that combines Adam-like first-order steps with L-BFGS-like second-order updates, and a practical FUSE-PV variant with a switchover mechanism. Theoretical analysis provides complexity insights under strongly convex and non-convex conditions, while extensive experiments on simple non-convex functions and diverse datasets demonstrate improved convergence and training efficiency. This approach offers a practical path to faster, more robust optimization under limited compute in deep learning and related models.

Abstract

Stochastic optimization methods have actively been playing a critical role in modern machine learning algorithms to deliver decent performance. While numerous works have proposed and developed diverse approaches, first-order and second-order methods are in entirely different situations. The former is significantly pivotal and dominating in emerging deep learning but only leads convergence to a stationary point. However, second-order methods are less popular due to their computational intensity in large-dimensional problems. This paper presents a novel method that leverages both the first-order and second-order methods in a unified algorithmic framework, termed FUSE, from which a practical version (PV) is derived accordingly. FUSE-PV stands as a simple yet efficient optimization method involving a switch-over between first and second orders. Additionally, we develop different criteria that determine when to switch. FUSE-PV has provably shown a smaller computational complexity than SGD and Adam. To validate our proposed scheme, we present an ablation study on several simple test functions and show a comparison with baselines for benchmark datasets.

FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic Optimization

TL;DR

The paper tackles the inefficiency of relying solely on first-order methods or incurring high cost with second-order methods in stochastic optimization for ML. It proposes FUSE, a unified framework that combines Adam-like first-order steps with L-BFGS-like second-order updates, and a practical FUSE-PV variant with a switchover mechanism. Theoretical analysis provides complexity insights under strongly convex and non-convex conditions, while extensive experiments on simple non-convex functions and diverse datasets demonstrate improved convergence and training efficiency. This approach offers a practical path to faster, more robust optimization under limited compute in deep learning and related models.

Abstract

Stochastic optimization methods have actively been playing a critical role in modern machine learning algorithms to deliver decent performance. While numerous works have proposed and developed diverse approaches, first-order and second-order methods are in entirely different situations. The former is significantly pivotal and dominating in emerging deep learning but only leads convergence to a stationary point. However, second-order methods are less popular due to their computational intensity in large-dimensional problems. This paper presents a novel method that leverages both the first-order and second-order methods in a unified algorithmic framework, termed FUSE, from which a practical version (PV) is derived accordingly. FUSE-PV stands as a simple yet efficient optimization method involving a switch-over between first and second orders. Additionally, we develop different criteria that determine when to switch. FUSE-PV has provably shown a smaller computational complexity than SGD and Adam. To validate our proposed scheme, we present an ablation study on several simple test functions and show a comparison with baselines for benchmark datasets.

Paper Structure

This paper contains 8 sections, 2 theorems, 4 equations, 7 figures, 2 tables, 4 algorithms.

Key Result

Theorem 1

(Informal) Suppose that Assumption assump_1 holds and that $f^i$ is strongly convex with constant $\mu>0$. There exists constant $\zeta>0$ and $\epsilon>0$ such that $\zeta>\epsilon$. Then FUSE-PV incurs the complexity with the order of $\mathcal{O}(\textnormal{max}\{\frac{1}{\zeta},\textnormal{log}

Figures (7)

  • Figure 1: Optimizer performance for 2D Rosenbrock function.
  • Figure 2: Optimizer performance for 2D Rastrigin function.
  • Figure 3: Optimizer performance for 2D Ackley function.
  • Figure 4: Optimizer performance for 2D Himmelblau function.
  • Figure 5: Training loss for different criteria (DenseNet on CIFAR-10)
  • ...and 2 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Remark 1
  • Theorem 2
  • Remark 2