Bayesian Natural Gradient Fine-Tuning of CLIP Models via Kalman Filtering
Hossein Abdi, Mingfei Sun, Wei Pan
TL;DR
This paper tackles the challenge of fine-tuning CLIP models with limited labeled data by introducing a Kalman-filter-based adapter that approximates the natural gradient direction within a Bayesian framework. The approach combines second-order optimization benefits with uncertainty quantification to boost in-distribution performance while improving out-of-distribution robustness. The authors provide a theoretical link between the Kalman update and NGD, and propose two robustness strategies for OOD via adaptive observation noise and Mahalanobis-distance-based regulation. Extensive experiments across six ID datasets and multiple OOD variants demonstrate consistent ID gains and improved OOD generalization, with ablations validating the influence of key hyperparameters. This method offers a robust, uncertainty-aware alternative to standard first-order fine-tuning for vision-language models like CLIP.
Abstract
Vision-language pre-trained models, such as CLIP, have established new benchmarks in multimodal data mining. In such models, few-shot fine-tuning is a major challenge to achieve optimal performance on both in-distribution (ID) and out-of-distribution (OOD) datasets, especially when labeled data is scarce. Most existing fine-tuning approaches rely on first-order gradient-based optimizers, which typically suffer from slow convergence, sensitivity to step-size hyperparameters, and poor generalization in OOD settings. In contrast, second-order methods utilize local curvature information of the loss landscape to adjust the update step size. This is particularly beneficial for CLIP models, whose non-convex loss functions often contain sharp critical points. In such cases, natural gradient direction can offer more substantial and efficient per-iteration updates when fine-tuning with limited data. Natural Gradient Descent (NGD) is obtained by preconditioning the standard gradient with the inverse Fisher Information Matrix (FIM), which is computationally expensive for large models. To address this, we propose a Bayesian approximation of NGD using a Kalman filter for CLIP models. Our method combines the benefits of second-order optimization with Bayesian inference, which enhances generalization while providing uncertainty quantification. Extensive experiments conducted on diverse image classification datasets demonstrate that our algorithm consistently achieves superior--or comparable--ID performance and improved OOD robustness compared to state-of-the-art baselines. To the best of our knowledge, this work represents the first successful application of Kalman filtering to fine-tuning CLIP-based models, which enables more robust and efficient learning in vision-language tasks.
