Table of Contents
Fetching ...

How (Mis)calibrated is Your Federated CLIP and What To Do About It?

Mainak Singha, Masih Aminbeidokhti, Paolo Casari, Elisa Ricci, Subhankar Roy

TL;DR

This work investigates how federated fine-tuning affects CLIP calibration, revealing that prompt-tuning approaches worsen calibration in FL and that traditional in-training calibration and aggregation strategies offer limited relief. It introduces FL^2oRA, a LoRA-based parameter-efficient fine-tuning method that preserves and enhances calibration while maintaining accuracy across diverse benchmarks. Through extensive experiments on in-distribution, domain generalization, and base-to-new settings, the authors demonstrate that FL^2oRA yields well-calibrated models with reduced need for explicit calibration procedures. The study also provides thorough ablations and insights into why LoRA-based updates stabilize calibration under client heterogeneity, highlighting practical implications for deploying reliable federated vision-language models.

Abstract

While vision-language models like CLIP have been extensively studied, their calibration, crucial for reliable predictions, has received limited attention. Although a few prior works have examined CLIP calibration in offline settings, the impact of fine-tuning CLIP in a federated learning (FL) setup remains unexplored. In this work, we investigate how FL affects CLIP calibration and propose strategies to improve reliability in this distributed setting. We first analyze Textual Prompt Tuning approaches and show that they degrade calibration metrics when operating under FL. We also evaluate existing in-training calibration techniques across four global aggregation methods, finding that they provide limited improvements. Our results suggest that the key challenge lies not only in how we aggregate or calibrate, but in which components we choose to fine-tune. Motivated by this insight, we propose $\text{FL}^2\text{oRA}$, a straightforward LoRA-based approach that naturally improves calibration in FL, and we analyze the factors behind its effectiveness. Experiments on multiple benchmarks demonstrate that $\text{FL}^2\text{oRA}$ consistently produces well-calibrated models, reducing the need for explicit calibration procedures. Codes are available at https://github.com/mainaksingha01/FL2oRA.

How (Mis)calibrated is Your Federated CLIP and What To Do About It?

TL;DR

This work investigates how federated fine-tuning affects CLIP calibration, revealing that prompt-tuning approaches worsen calibration in FL and that traditional in-training calibration and aggregation strategies offer limited relief. It introduces FL^2oRA, a LoRA-based parameter-efficient fine-tuning method that preserves and enhances calibration while maintaining accuracy across diverse benchmarks. Through extensive experiments on in-distribution, domain generalization, and base-to-new settings, the authors demonstrate that FL^2oRA yields well-calibrated models with reduced need for explicit calibration procedures. The study also provides thorough ablations and insights into why LoRA-based updates stabilize calibration under client heterogeneity, highlighting practical implications for deploying reliable federated vision-language models.

Abstract

While vision-language models like CLIP have been extensively studied, their calibration, crucial for reliable predictions, has received limited attention. Although a few prior works have examined CLIP calibration in offline settings, the impact of fine-tuning CLIP in a federated learning (FL) setup remains unexplored. In this work, we investigate how FL affects CLIP calibration and propose strategies to improve reliability in this distributed setting. We first analyze Textual Prompt Tuning approaches and show that they degrade calibration metrics when operating under FL. We also evaluate existing in-training calibration techniques across four global aggregation methods, finding that they provide limited improvements. Our results suggest that the key challenge lies not only in how we aggregate or calibrate, but in which components we choose to fine-tune. Motivated by this insight, we propose , a straightforward LoRA-based approach that naturally improves calibration in FL, and we analyze the factors behind its effectiveness. Experiments on multiple benchmarks demonstrate that consistently produces well-calibrated models, reducing the need for explicit calibration procedures. Codes are available at https://github.com/mainaksingha01/FL2oRA.

Paper Structure

This paper contains 25 sections, 13 equations, 10 figures, 16 tables.

Figures (10)

  • Figure 1: Comparison of reliability diagrams and calibration errors. Expected Calibration Error (ECE $\downarrow$) guo2017calibration is reported on the OxfordPets oxfordpets dataset on a centralized training or offline (left) and a non-IID personalized federal learning (FL) setting (right).
  • Figure 2: Effect of varying $\alpha$ of Dirichlet distribution. We report (a) Accuracy and (b) ECE metrics on CIFAR-100 dataset.
  • Figure 3: Effect of varying communication rounds. We report (a) Accuracy and (b) ECE metrics on CIFAR-100 dataset.
  • Figure 4: Ablation of rank of LoRA metrices as (a) Accuracy and (b) ECE metrics on CIFAR-100 dataset. To be noted, 'T' and 'V' refer to text and vision respectively.
  • Figure 5: Ablation of PEFT strategies. as (a) Accuracy and (b) ECE metrics on CIFAR-100 dataset.
  • ...and 5 more figures