Table of Contents
Fetching ...

DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning

Nikolay Yudin, Ekaterina Grishina, Andrey Veprikov, Alexandr Beznosikov, Maxim Rakhuba

TL;DR

DyKAF provides a dynamic, projector-splitting-based dynamical low-rank approximation of the empirical Fisher matrix, maintaining a Kronecker factorization $F_t\approx L_t\otimes R_t$ to enable efficient gradient preconditioning within SOAP-style optimizers. By updating the Kron factors with a Kron_proj_split scheme and initializing from the dominant gradient structure, DyKAF achieves higher Fisher-approximation accuracy than prior Kronecker approaches and remains computationally lightweight. Empirical results across GLUE fine-tuning, large-scale LLM adaptation, and diverse pretraining tasks show DyKAF delivering strong improvements over AdamW, Muon, and SOAP, including better performance on reasoning tasks and robustness to scale without extensive hyperparameter sweeps.

Abstract

Recently, optimizers that explicitly treat weights as matrices, rather than flattened vectors, have demonstrated their effectiveness. This perspective naturally leads to structured approximations of the Fisher matrix as preconditioners, where the matrix view induces a Kronecker-factorized form that enables memory-efficient representation. However, constructing such approximations both efficiently and accurately remains an open challenge, since obtaining the optimal factorization is resource-intensive and practical methods therefore rely on heuristic design choices. In this work, we introduce a novel approach that leverages projector-splitting integrators to construct effective preconditioners. Our optimizer, DyKAF (Dynamical Kronecker Approximation of the Fisher Matrix), consistently improves the Fisher matrix approximation quality. Experiments on large language model pre-training and fine-tuning demonstrate that DyKAF outperforms existing optimizers across a range of evaluation metrics.

DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning

TL;DR

DyKAF provides a dynamic, projector-splitting-based dynamical low-rank approximation of the empirical Fisher matrix, maintaining a Kronecker factorization to enable efficient gradient preconditioning within SOAP-style optimizers. By updating the Kron factors with a Kron_proj_split scheme and initializing from the dominant gradient structure, DyKAF achieves higher Fisher-approximation accuracy than prior Kronecker approaches and remains computationally lightweight. Empirical results across GLUE fine-tuning, large-scale LLM adaptation, and diverse pretraining tasks show DyKAF delivering strong improvements over AdamW, Muon, and SOAP, including better performance on reasoning tasks and robustness to scale without extensive hyperparameter sweeps.

Abstract

Recently, optimizers that explicitly treat weights as matrices, rather than flattened vectors, have demonstrated their effectiveness. This perspective naturally leads to structured approximations of the Fisher matrix as preconditioners, where the matrix view induces a Kronecker-factorized form that enables memory-efficient representation. However, constructing such approximations both efficiently and accurately remains an open challenge, since obtaining the optimal factorization is resource-intensive and practical methods therefore rely on heuristic design choices. In this work, we introduce a novel approach that leverages projector-splitting integrators to construct effective preconditioners. Our optimizer, DyKAF (Dynamical Kronecker Approximation of the Fisher Matrix), consistently improves the Fisher matrix approximation quality. Experiments on large language model pre-training and fine-tuning demonstrate that DyKAF outperforms existing optimizers across a range of evaluation metrics.

Paper Structure

This paper contains 43 sections, 5 theorems, 76 equations, 3 figures, 10 tables, 4 algorithms.

Key Result

Proposition 1

The NKP approximation problem eq:nkp is equivalent to finding the best rank-1 approximation to $\mathcal{R}(A)$:

Figures (3)

  • Figure 1: Comparison of Hessian approximation in the controlled environment. The Frobenius norm of the difference between the Hessian and its approximations by SOAP and DyKAF (Algorithm \ref{['alg:dykaf']}) is shown for varying training sample sizes. Our algorithm achieves consistently better approximation accuracy.
  • Figure 2: Fisher matrix approximation error with different methods in the case when $G_i$ are sampled from $\mathcal{N}(0, I)$.
  • Figure 3: Ablation on rank1_second_moment during pretraining of LLaMA-124M on FineWeb. Unlike in fine-tuning, the rank-1 approximation consistently degrades performance, supporting the use of a full-rank second moment in pretraining.

Theorems & Definitions (18)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Proposition 1: golub2013matrix
  • Proposition 2
  • proof
  • Definition 6
  • Proposition 3
  • ...and 8 more