Table of Contents
Fetching ...

Bayesian Natural Gradient Fine-Tuning of CLIP Models via Kalman Filtering

Hossein Abdi, Mingfei Sun, Wei Pan

TL;DR

This paper tackles the challenge of fine-tuning CLIP models with limited labeled data by introducing a Kalman-filter-based adapter that approximates the natural gradient direction within a Bayesian framework. The approach combines second-order optimization benefits with uncertainty quantification to boost in-distribution performance while improving out-of-distribution robustness. The authors provide a theoretical link between the Kalman update and NGD, and propose two robustness strategies for OOD via adaptive observation noise and Mahalanobis-distance-based regulation. Extensive experiments across six ID datasets and multiple OOD variants demonstrate consistent ID gains and improved OOD generalization, with ablations validating the influence of key hyperparameters. This method offers a robust, uncertainty-aware alternative to standard first-order fine-tuning for vision-language models like CLIP.

Abstract

Vision-language pre-trained models, such as CLIP, have established new benchmarks in multimodal data mining. In such models, few-shot fine-tuning is a major challenge to achieve optimal performance on both in-distribution (ID) and out-of-distribution (OOD) datasets, especially when labeled data is scarce. Most existing fine-tuning approaches rely on first-order gradient-based optimizers, which typically suffer from slow convergence, sensitivity to step-size hyperparameters, and poor generalization in OOD settings. In contrast, second-order methods utilize local curvature information of the loss landscape to adjust the update step size. This is particularly beneficial for CLIP models, whose non-convex loss functions often contain sharp critical points. In such cases, natural gradient direction can offer more substantial and efficient per-iteration updates when fine-tuning with limited data. Natural Gradient Descent (NGD) is obtained by preconditioning the standard gradient with the inverse Fisher Information Matrix (FIM), which is computationally expensive for large models. To address this, we propose a Bayesian approximation of NGD using a Kalman filter for CLIP models. Our method combines the benefits of second-order optimization with Bayesian inference, which enhances generalization while providing uncertainty quantification. Extensive experiments conducted on diverse image classification datasets demonstrate that our algorithm consistently achieves superior--or comparable--ID performance and improved OOD robustness compared to state-of-the-art baselines. To the best of our knowledge, this work represents the first successful application of Kalman filtering to fine-tuning CLIP-based models, which enables more robust and efficient learning in vision-language tasks.

Bayesian Natural Gradient Fine-Tuning of CLIP Models via Kalman Filtering

TL;DR

This paper tackles the challenge of fine-tuning CLIP models with limited labeled data by introducing a Kalman-filter-based adapter that approximates the natural gradient direction within a Bayesian framework. The approach combines second-order optimization benefits with uncertainty quantification to boost in-distribution performance while improving out-of-distribution robustness. The authors provide a theoretical link between the Kalman update and NGD, and propose two robustness strategies for OOD via adaptive observation noise and Mahalanobis-distance-based regulation. Extensive experiments across six ID datasets and multiple OOD variants demonstrate consistent ID gains and improved OOD generalization, with ablations validating the influence of key hyperparameters. This method offers a robust, uncertainty-aware alternative to standard first-order fine-tuning for vision-language models like CLIP.

Abstract

Vision-language pre-trained models, such as CLIP, have established new benchmarks in multimodal data mining. In such models, few-shot fine-tuning is a major challenge to achieve optimal performance on both in-distribution (ID) and out-of-distribution (OOD) datasets, especially when labeled data is scarce. Most existing fine-tuning approaches rely on first-order gradient-based optimizers, which typically suffer from slow convergence, sensitivity to step-size hyperparameters, and poor generalization in OOD settings. In contrast, second-order methods utilize local curvature information of the loss landscape to adjust the update step size. This is particularly beneficial for CLIP models, whose non-convex loss functions often contain sharp critical points. In such cases, natural gradient direction can offer more substantial and efficient per-iteration updates when fine-tuning with limited data. Natural Gradient Descent (NGD) is obtained by preconditioning the standard gradient with the inverse Fisher Information Matrix (FIM), which is computationally expensive for large models. To address this, we propose a Bayesian approximation of NGD using a Kalman filter for CLIP models. Our method combines the benefits of second-order optimization with Bayesian inference, which enhances generalization while providing uncertainty quantification. Extensive experiments conducted on diverse image classification datasets demonstrate that our algorithm consistently achieves superior--or comparable--ID performance and improved OOD robustness compared to state-of-the-art baselines. To the best of our knowledge, this work represents the first successful application of Kalman filtering to fine-tuning CLIP-based models, which enables more robust and efficient learning in vision-language tasks.

Paper Structure

This paper contains 27 sections, 2 theorems, 17 equations, 3 figures, 4 tables, 1 algorithm.

Key Result

Lemma 4.1

The update step of the standard Kalman algorithm, as presented in Equation Eq. Algorithm Updating, can be reformulated as:

Figures (3)

  • Figure 1: We employ a Kalman-based adapter to fine-tune the CLIP models. Kalman-based optimization algorithm closely approximates the natural gradient direction within a Bayesian framework. While natural gradient facilitates improved ID performance, Bayesian formulation inherently enables uncertainty quantification, which leads to improvement in OOD generalization.
  • Figure 2: Accuracy results of different few-shot fine-tuning scenarios for six image classification datasets. Our method (blue) consistently achieves superior ID performance in every few-shot setup, and in certain cases, performs comparably to the baselines: Tip-Adapter-F (green), CLIP-Adapter (red), CoOp (purple), and Zero-Shot CLIP (orange).
  • Figure 3: Absolute improvement in accuracy across 11 image classification datasets achieved by our method compared to Zero-Shot CLIP. Results are reported for the 16-shot fine-tuning scenario.

Theorems & Definitions (4)

  • Lemma 4.1
  • proof
  • Proposition 4.2
  • proof