Table of Contents
Fetching ...

Converge Faster, Talk Less: Hessian-Informed Federated Zeroth-Order Optimization

Zhe Li, Bicheng Ying, Zidong Liu, Chaosheng Dong, Haibo Yang

TL;DR

This work introduces HiSo, a Hessian-informed zeroth-order federated optimization method that preserves strict scalar-only communication while leveraging a global diagonal Hessian approximation to accelerate convergence. The authors develop a generalized scalar-only FL framework, derive a Hessian-informed ascent step, and learn curvature diagonally with Adam-like updates without increasing communication. Under a low-effective rank Hessian assumption, HiSo achieves convergence rates that are independent of model dimension $d$ and Lipschitz constant $L$, outperforming prior ZO-FL baselines in both theoretical guarantees and empirical LLM fine-tuning tasks, with substantial reductions in communication rounds and total data exchanged. The results demonstrate that incorporating curvature information through diagonal preconditioning can dramatically improve ZO-FL efficiency, making it practical for large-scale federated fine-tuning scenarios.

Abstract

Zeroth-order (ZO) optimization enables dimension-free communication in federated learning (FL), making it attractive for fine-tuning of large language models (LLMs) due to significant communication savings. However, existing ZO-FL methods largely overlook curvature information, despite its well-established benefits for convergence acceleration. To address this, we propose HiSo, a Hessian-informed ZO federated optimization method that accelerates convergence by leveraging global diagonal Hessian approximations, while strictly preserving scalar-only communication without transmitting any second-order information. Theoretically, for non-convex functions, we show that HiSo can achieve an accelerated convergence rate that is independent of the Lipschitz constant $L$ and model dimension $d$ under some Hessian approximation assumptions, offering a plausible explanation for the observed phenomenon of ZO convergence being much faster than its worst-case $\mathscr{O}(d)$-bound. Empirically, across diverse LLM fine-tuning benchmarks, HiSo delivers a 1$\sim$5$\times$ speedup in communication rounds over existing state-of-the-art ZO-FL baselines. This superior convergence not only cuts communication costs but also provides strong empirical evidence that Hessian information acts as an effective accelerator in federated ZO optimization settings. Our source code is provided at https://github.com/ZidongLiu/DeComFL.

Converge Faster, Talk Less: Hessian-Informed Federated Zeroth-Order Optimization

TL;DR

This work introduces HiSo, a Hessian-informed zeroth-order federated optimization method that preserves strict scalar-only communication while leveraging a global diagonal Hessian approximation to accelerate convergence. The authors develop a generalized scalar-only FL framework, derive a Hessian-informed ascent step, and learn curvature diagonally with Adam-like updates without increasing communication. Under a low-effective rank Hessian assumption, HiSo achieves convergence rates that are independent of model dimension and Lipschitz constant , outperforming prior ZO-FL baselines in both theoretical guarantees and empirical LLM fine-tuning tasks, with substantial reductions in communication rounds and total data exchanged. The results demonstrate that incorporating curvature information through diagonal preconditioning can dramatically improve ZO-FL efficiency, making it practical for large-scale federated fine-tuning scenarios.

Abstract

Zeroth-order (ZO) optimization enables dimension-free communication in federated learning (FL), making it attractive for fine-tuning of large language models (LLMs) due to significant communication savings. However, existing ZO-FL methods largely overlook curvature information, despite its well-established benefits for convergence acceleration. To address this, we propose HiSo, a Hessian-informed ZO federated optimization method that accelerates convergence by leveraging global diagonal Hessian approximations, while strictly preserving scalar-only communication without transmitting any second-order information. Theoretically, for non-convex functions, we show that HiSo can achieve an accelerated convergence rate that is independent of the Lipschitz constant and model dimension under some Hessian approximation assumptions, offering a plausible explanation for the observed phenomenon of ZO convergence being much faster than its worst-case -bound. Empirically, across diverse LLM fine-tuning benchmarks, HiSo delivers a 15 speedup in communication rounds over existing state-of-the-art ZO-FL baselines. This superior convergence not only cuts communication costs but also provides strong empirical evidence that Hessian information acts as an effective accelerator in federated ZO optimization settings. Our source code is provided at https://github.com/ZidongLiu/DeComFL.

Paper Structure

This paper contains 46 sections, 10 theorems, 93 equations, 10 figures, 10 tables, 2 algorithms.

Key Result

Theorem 1

Under Assumptions assump.l.hessian, assumption.stochastic gradients, assumption.bounded-hetero, and assumption.bounded-learned-hessian, if $\eta \leq \min\left(\frac{\beta_\ell}{mL}, \frac{1}{8\rho_k}, \frac{\beta_\ell}{4(\tau-1)}\sqrt{\frac{1}{L(d+2)}}\right)$ and denote $\Delta_{1,*} := F(\bar{x}_ where ${\bar{x}}_{r,k} = \frac{1}{M}\sum_{i=1}^M x_{r,k}^{(i)}$, $\bar{\rho} = \frac{1}{\tau R}\sum

Figures (10)

  • Figure 1: An illustration of ZO update.
  • Figure 2: One-round update with 2 clients and 3 local updates. They share the same direction for each local update with different lengths. Arrive $x_{r+1}$ for both clients requires 7 steps: 3 local updates, reset and 3 updates with global values.
  • Figure 3: Illustration of HiSo
  • Figure 4: An Illustration of the Eigenvalue Distribution.
  • Figure 5: Ablation study of smoothing parameter $\nu$ and the distribution of the learned global Hessian $H$.
  • ...and 5 more figures

Theorems & Definitions (13)

  • Theorem 1
  • Corollary 1: Convergence Rate for HiSo
  • Corollary 2: Convergence Rate for DeComFL
  • Corollary 3: Convergence Rate for $\tau>1$ case
  • Lemma 1: Fourth-Order Moment of Gaussian Vector
  • proof
  • Lemma 2: Gaussian Smoothed Function
  • proof
  • Lemma 3
  • proof
  • ...and 3 more