Table of Contents
Fetching ...

FedPM: Federated Learning Using Second-order Optimization with Preconditioned Mixing of Local Parameters

Hiro Ishii, Kenta Niwa, Hiroshi Sawada, Akinori Fujino, Noboru Harada, Rio Yokota

TL;DR

FedPM tackles drift in local second-order preconditioners in federated learning by replacing simple server mixing with preconditioned mixing of local parameters, aligning updates with the globally preconditioned curvature. The method decomposes the ideal global second-order update into per-client local updates and server-side mixing using a shared preconditioner, enabling global second-order optimization even with multiple local updates. A convergence analysis shows a superlinear rate for strongly convex objectives with a single local update, and FedPM employs FOOF-based preconditioner approximations to scale to deep networks. Empirically, FedPM outperforms FO and SO baselines on both strongly convex and non-convex tasks, particularly under data heterogeneity, confirming practical benefits for robust, fast FL training.

Abstract

We propose Federated Preconditioned Mixing (FedPM), a novel Federated Learning (FL) method that leverages second-order optimization. Prior methods--such as LocalNewton, LTDA, and FedSophia--have incorporated second-order optimization in FL by performing iterative local updates on clients and applying simple mixing of local parameters on the server. However, these methods often suffer from drift in local preconditioners, which significantly disrupts the convergence of parameter training, particularly in heterogeneous data settings. To overcome this issue, we refine the update rules by decomposing the ideal second-order update--computed using globally preconditioned global gradients--into parameter mixing on the server and local parameter updates on clients. As a result, our FedPM introduces preconditioned mixing of local parameters on the server, effectively mitigating drift in local preconditioners. We provide a theoretical convergence analysis demonstrating a superlinear rate for strongly convex objectives in scenarios involving a single local update. To demonstrate the practical benefits of FedPM, we conducted extensive experiments. The results showed significant improvements with FedPM in the test accuracy compared to conventional methods incorporating simple mixing, fully leveraging the potential of second-order optimization.

FedPM: Federated Learning Using Second-order Optimization with Preconditioned Mixing of Local Parameters

TL;DR

FedPM tackles drift in local second-order preconditioners in federated learning by replacing simple server mixing with preconditioned mixing of local parameters, aligning updates with the globally preconditioned curvature. The method decomposes the ideal global second-order update into per-client local updates and server-side mixing using a shared preconditioner, enabling global second-order optimization even with multiple local updates. A convergence analysis shows a superlinear rate for strongly convex objectives with a single local update, and FedPM employs FOOF-based preconditioner approximations to scale to deep networks. Empirically, FedPM outperforms FO and SO baselines on both strongly convex and non-convex tasks, particularly under data heterogeneity, confirming practical benefits for robust, fast FL training.

Abstract

We propose Federated Preconditioned Mixing (FedPM), a novel Federated Learning (FL) method that leverages second-order optimization. Prior methods--such as LocalNewton, LTDA, and FedSophia--have incorporated second-order optimization in FL by performing iterative local updates on clients and applying simple mixing of local parameters on the server. However, these methods often suffer from drift in local preconditioners, which significantly disrupts the convergence of parameter training, particularly in heterogeneous data settings. To overcome this issue, we refine the update rules by decomposing the ideal second-order update--computed using globally preconditioned global gradients--into parameter mixing on the server and local parameter updates on clients. As a result, our FedPM introduces preconditioned mixing of local parameters on the server, effectively mitigating drift in local preconditioners. We provide a theoretical convergence analysis demonstrating a superlinear rate for strongly convex objectives in scenarios involving a single local update. To demonstrate the practical benefits of FedPM, we conducted extensive experiments. The results showed significant improvements with FedPM in the test accuracy compared to conventional methods incorporating simple mixing, fully leveraging the potential of second-order optimization.

Paper Structure

This paper contains 30 sections, 1 theorem, 31 equations, 7 figures, 16 tables, 1 algorithm.

Key Result

Theorem 1

Under the assumptions of strong convexity and Hessian smoothness, and given an initial parameter condition that implicitly assumes sufficiently close to the optimal solution, the FedPM algorithm with a single local update ($K=1$) as defined in (eq:fedpm_single) achieves a superlinear convergence rat

Figures (7)

  • Figure 1: Convergence curves for Test 1 ($K=1$) using global parameter on w8a (top) and a9a (bottom). The left column displays the difference in function output at each round and at the optimal solution. The right column depicts the L2 norm of the difference between the parameter at each round and the optimal solution.
  • Figure 2: Convergence curves for Test 2 using test accuracy for the global parameter on CIFAR10 classification with heterogeneity level of $\alpha=0.1$ and 5 local epochs. (a) depicts test accuracy against communication rounds, whereas (b) shows test accuracy against runtime including communication overhead. The computation resource we used is summarized in Appendix \ref{['secap:settings']}. The shaded area depicts one standard deviation of results across three different random seeds.
  • Figure 3: Relationship between the number of multiple local parameter updates and test accuracy for CIFAR10 classification with $\alpha=0.1$. The plot shows the average highest test accuracy across three seeds. Experiments are conducted with a fixed total of $500$ communication rounds achieved by using 1 inner epoch (500 rounds), 5 inner epochs (100 rounds), and 10 inner epochs (50 rounds).
  • Figure 4: Data distributions of CIFAR10 and CIFAR100 under $\alpha$ = 1.0 and $\alpha$ = 0.1 for a single random seed. Different colors represent different classes.
  • Figure 5: Convergence curves for Test 2 using the global model on CIFAR100 with a heterogeneity level of $\alpha=0.1$ and 5 local epochs. (a) depicts test accuracy against communication rounds, while (b) shows test accuracy against runtime. The shaded areas represent one standard deviation across three random seeds. These results correspond to the settings for CIFAR100 with $\alpha=0.1$ reported in Table 3.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 1: (Informal) Convergence rate of FedPM under single local update
  • proof