Table of Contents
Fetching ...

LoKO: Low-Rank Kalman Optimizer for Online Fine-Tuning of Large Models

Hossein Abdi, Mingfei Sun, Andi Zhang, Samuel Kaski, Wei Pan

TL;DR

LoKO reframes online fine-tuning of large models as a state-estimation problem and combines Low-Rank Adaptation (LoRA) with a diagonal covariance Kalman filter to achieve scalable online optimization. By reducing trainable parameters to $r(d+q)$ and maintaining a diagonal covariance, LoKO achieves linear-time complexity in the number of trainable parameters and uses an EMA-based scheme to estimate the observation noise covariance $R_k$. Empirical results across computer vision and language benchmarks show LoKO converges faster and attains higher online accuracy than standard LoRA-based optimizers, with robustness to initialization and covariance estimation choices. This work demonstrates the feasibility of Kalman-filter-based optimization for online fine-tuning of transformer- and CNN-based large models, offering a performant alternative to gradient-based methods in streaming data settings.

Abstract

Training large models with millions or even billions of parameters from scratch incurs substantial computational costs. Parameter Efficient Fine-Tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), address this challenge by adapting only a reduced number of parameters to specific tasks with gradient-based optimizers. In this paper, we cast PEFT as an optimal filtering/state estimation problem and present Low-Rank Kalman Optimizer (LoKO) to estimate the optimal trainable parameters in an online manner. We leverage the low-rank decomposition in LoRA to significantly reduce matrix sizes in Kalman iterations and further capitalize on a diagonal approximation of the covariance matrix to effectively decrease computational complexity from quadratic to linear in the number of trainable parameters. Moreover, we discovered that the initialization of the covariance matrix within the Kalman algorithm and the accurate estimation of the observation noise covariance are the keys in this formulation, and we propose robust approaches that work well across a vast range of well-established computer vision and language models. Our results show that LoKO converges with fewer iterations and yields better performance models compared to commonly used optimizers with LoRA in both image classifications and language tasks. Our study opens up the possibility of leveraging the Kalman filter as an effective optimizer for the online fine-tuning of large models.

LoKO: Low-Rank Kalman Optimizer for Online Fine-Tuning of Large Models

TL;DR

LoKO reframes online fine-tuning of large models as a state-estimation problem and combines Low-Rank Adaptation (LoRA) with a diagonal covariance Kalman filter to achieve scalable online optimization. By reducing trainable parameters to and maintaining a diagonal covariance, LoKO achieves linear-time complexity in the number of trainable parameters and uses an EMA-based scheme to estimate the observation noise covariance . Empirical results across computer vision and language benchmarks show LoKO converges faster and attains higher online accuracy than standard LoRA-based optimizers, with robustness to initialization and covariance estimation choices. This work demonstrates the feasibility of Kalman-filter-based optimization for online fine-tuning of transformer- and CNN-based large models, offering a performant alternative to gradient-based methods in streaming data settings.

Abstract

Training large models with millions or even billions of parameters from scratch incurs substantial computational costs. Parameter Efficient Fine-Tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), address this challenge by adapting only a reduced number of parameters to specific tasks with gradient-based optimizers. In this paper, we cast PEFT as an optimal filtering/state estimation problem and present Low-Rank Kalman Optimizer (LoKO) to estimate the optimal trainable parameters in an online manner. We leverage the low-rank decomposition in LoRA to significantly reduce matrix sizes in Kalman iterations and further capitalize on a diagonal approximation of the covariance matrix to effectively decrease computational complexity from quadratic to linear in the number of trainable parameters. Moreover, we discovered that the initialization of the covariance matrix within the Kalman algorithm and the accurate estimation of the observation noise covariance are the keys in this formulation, and we propose robust approaches that work well across a vast range of well-established computer vision and language models. Our results show that LoKO converges with fewer iterations and yields better performance models compared to commonly used optimizers with LoRA in both image classifications and language tasks. Our study opens up the possibility of leveraging the Kalman filter as an effective optimizer for the online fine-tuning of large models.

Paper Structure

This paper contains 43 sections, 1 theorem, 7 equations, 9 figures, 6 tables, 1 algorithm.

Key Result

Proposition 1

Leveraging the low-rank decomposition technique in LoRA and applying the diagonal approximation of covariance matrix, the steps of the Low-Rank Kalman Optimizer (LoKO) can be outlined below:

Figures (9)

  • Figure 1: Performance of LoKO (blue) compared to LoRA/AdamW (red) and LoRA/AdaGrad (green) for different computer vision datasets and models. The upper rows show the training loss, and the lower rows display the average online accuracy versus the number of observed data.
  • Figure 2: Comparison of LoKO (blue) with LoRA/AdamW (red) and LoRA/AdaGrad (green) across various language models and datasets. For each combination, the top row presents the training loss, while the bottom row illustrates the average online accuracy against the number of data points observed.
  • Figure 3: Performance of DoRA/Kalman (blue) compared to DoRA/AdamW (red) and DoRA/AdaGrad (green) for different computer vision datasets and models. The upper rows show the training loss, and the lower rows display the average online accuracy versus the number of observed data.
  • Figure 4: Comparison of DoRA/Kalman (blue) with DoRA/AdamW (red) and DoRA/AdaGrad (green) across various language models and datasets. For each combination, the top row presents the training loss, while the bottom row illustrates the average online accuracy against the number of data points observed.
  • Figure 5: Evolution of covariance matrix ${\bm{P}}_{k} \in \mathbb{R}^{n \times n}$ in LeNet-5 using Kalman optimizer. The matrix starts with a fully dense positive-definite matrix, and with the progress of the training algorithm, it gradually converges to a (block-)diagonal configuration.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Proposition 1
  • proof
  • proof
  • proof
  • proof