DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning

Nikolay Yudin; Ekaterina Grishina; Andrey Veprikov; Alexandr Beznosikov; Maxim Rakhuba

DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning

Nikolay Yudin, Ekaterina Grishina, Andrey Veprikov, Alexandr Beznosikov, Maxim Rakhuba

TL;DR

DyKAF provides a dynamic, projector-splitting-based dynamical low-rank approximation of the empirical Fisher matrix, maintaining a Kronecker factorization $F_t\approx L_t\otimes R_t$ to enable efficient gradient preconditioning within SOAP-style optimizers. By updating the Kron factors with a Kron_proj_split scheme and initializing from the dominant gradient structure, DyKAF achieves higher Fisher-approximation accuracy than prior Kronecker approaches and remains computationally lightweight. Empirical results across GLUE fine-tuning, large-scale LLM adaptation, and diverse pretraining tasks show DyKAF delivering strong improvements over AdamW, Muon, and SOAP, including better performance on reasoning tasks and robustness to scale without extensive hyperparameter sweeps.

Abstract

Recently, optimizers that explicitly treat weights as matrices, rather than flattened vectors, have demonstrated their effectiveness. This perspective naturally leads to structured approximations of the Fisher matrix as preconditioners, where the matrix view induces a Kronecker-factorized form that enables memory-efficient representation. However, constructing such approximations both efficiently and accurately remains an open challenge, since obtaining the optimal factorization is resource-intensive and practical methods therefore rely on heuristic design choices. In this work, we introduce a novel approach that leverages projector-splitting integrators to construct effective preconditioners. Our optimizer, DyKAF (Dynamical Kronecker Approximation of the Fisher Matrix), consistently improves the Fisher matrix approximation quality. Experiments on large language model pre-training and fine-tuning demonstrate that DyKAF outperforms existing optimizers across a range of evaluation metrics.

DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning

TL;DR

DyKAF provides a dynamic, projector-splitting-based dynamical low-rank approximation of the empirical Fisher matrix, maintaining a Kronecker factorization

to enable efficient gradient preconditioning within SOAP-style optimizers. By updating the Kron factors with a Kron_proj_split scheme and initializing from the dominant gradient structure, DyKAF achieves higher Fisher-approximation accuracy than prior Kronecker approaches and remains computationally lightweight. Empirical results across GLUE fine-tuning, large-scale LLM adaptation, and diverse pretraining tasks show DyKAF delivering strong improvements over AdamW, Muon, and SOAP, including better performance on reasoning tasks and robustness to scale without extensive hyperparameter sweeps.

DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning

TL;DR

Abstract

DyKAF: Dynamical Kronecker Approximation of the Fisher Information Matrix for Gradient Preconditioning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (18)