KOALA++: Efficient Kalman-Based Optimization with Gradient-Covariance Products
Zixuan Xia, Aram Davtyan, Paolo Favaro
TL;DR
KOALA++ advances Kalman-based optimization by propagating a directional gradient-covariance surrogate $v_k = H_k P_{k-1}$ to capture structured uncertainty without storing the full covariance $P_k$. It introduces low-rank reparameterizations, a recursive vk update, and two least-squares covariance-estimation variants (vanilla and symmetric), yielding an update that remains near first-order in cost while incorporating directional curvature information. Empirically, KOALA++ matches or surpasses strong first- and second-order baselines across image classification and language modeling benchmarks, with favorable stability and efficiency. The approach offers a practical bridge between expressiveness and scalability for large-scale neural optimization, with potential for integration into pretraining and transformer-based workloads. Future work includes enforcing positive semi-definiteness of the covariance surrogate and extending the method to larger-scale, real-world models.
Abstract
We propose KOALA++, a scalable Kalman-based optimization algorithm that explicitly models structured gradient uncertainty in neural network training. Unlike second-order methods, which rely on expensive second order gradient calculation, our method directly estimates the parameter covariance matrix by recursively updating compact gradient covariance products. This design improves upon the original KOALA framework that assumed diagonal covariance by implicitly capturing richer uncertainty structure without storing the full covariance matrix and avoiding large matrix inversions. Across diverse tasks, including image classification and language modeling, KOALA++ achieves accuracy on par or better than state-of-the-art first- and second-order optimizers while maintaining the efficiency of first-order methods.
