Table of Contents
Fetching ...

Towards Calibrated Robust Fine-Tuning of Vision-Language Models

Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, Kyungwoo Song

TL;DR

This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models and designs a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.

Abstract

Improving out-of-distribution (OOD) generalization during in-distribution (ID) adaptation is a primary goal of robust fine-tuning of zero-shot models beyond naive fine-tuning. However, despite decent OOD generalization performance from recent robust fine-tuning methods, confidence calibration for reliable model output has not been fully addressed. This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models. Firstly, we show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data: 1) ID calibration error and 2) the smallest singular value of the ID input covariance matrix. Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value, which is further guided by the self-distillation of a moving-averaged model to achieve calibrated prediction as well. Starting from empirical evidence supporting our theoretical statements, we provide extensive experimental results on ImageNet distribution shift benchmarks that demonstrate the effectiveness of our theorem and its practical implementation.

Towards Calibrated Robust Fine-Tuning of Vision-Language Models

TL;DR

This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models and designs a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.

Abstract

Improving out-of-distribution (OOD) generalization during in-distribution (ID) adaptation is a primary goal of robust fine-tuning of zero-shot models beyond naive fine-tuning. However, despite decent OOD generalization performance from recent robust fine-tuning methods, confidence calibration for reliable model output has not been fully addressed. This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models. Firstly, we show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data: 1) ID calibration error and 2) the smallest singular value of the ID input covariance matrix. Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value, which is further guided by the self-distillation of a moving-averaged model to achieve calibrated prediction as well. Starting from empirical evidence supporting our theoretical statements, we provide extensive experimental results on ImageNet distribution shift benchmarks that demonstrate the effectiveness of our theorem and its practical implementation.
Paper Structure (24 sections, 5 theorems, 16 equations, 8 figures, 16 tables)

This paper contains 24 sections, 5 theorems, 16 equations, 8 figures, 16 tables.

Key Result

Theorem 3.1

Let $h:\mathcal{X}\rightarrow[0,1]$ be a real-valued function of structure $h(x)=\sum_{i=1}^{d}h_{i}(x[i])$ where $h_{i}$ is an arbitrary one-dimensional function, and $h$ is in a hypothesis class $\mathcal{H}$ that has pseudo dimension $\mathcal{P} dim(\mathcal{H})=d_{h}$, $\hat{\mathcal{D}}_\text{

Figures (8)

  • Figure 1: OOD accuracy vs. ID accuracy (left) and negative OOD ECE (right). To maintain consistency in the plots, where desired values are shown on the right side of the x-axis, we report negative OOD ECE. ID ACC refers to ImageNet-1K top-1 accuracy; OOD ACC and ECE refer to the averaged accuracy and ECE of the five ImageNet distribution shifts (ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch, and ObjectNet), respectively. Detailed numbers are reported in Table \ref{['tab:natural_shift_acc']} and \ref{['tab:natural_shift_ece']}. Note that the competing methods -- FLYP goyal2023finetune, LP-FT kumar2022fine, and Lipsum-FT nam2024lipsum -- improve OOD accuracy over the zero-shot baseline (ZS) and naive fine-tuning (FT) but suffer from OOD miscalibration, presumably due to concerning generalization solely during fine-tuning. Our CaRot outperforms existing methods on both OOD accuracy and calibration by large margins.
  • Figure 2: Overview of CaRot. We fine-tune a VLM using a multimodal contrastive loss with an orthogonality constraint on visual projection layer (eq.\ref{['eq:loss_mclsr']}) and self-distillation $\mathcal{L}_{\text{SD}}$ (eq.\ref{['eq:loss_mkd']}) that takes predictions of EMA teacher $\psi$ as soft target labels to train the student model $\theta$. The darker and the lighter elements denote values closer to 1 and 0, respectively. Both teacher and student models share identical VLM architecture consisting of image $f_{\theta_{v}}:=[f_{\hat{\theta}_v}; W_v]$ and text $g_{\theta_{l}}:=[g_{\hat{\theta}_l}; W_l]$ encoders, where $W$ is the last projection layer. Given (image, text) pair data, the model outputs the pair-wise similarity score for in-batch image-text representations.
  • Figure 3: Analysis of error bounds on synthetic data. Plots on the left side show RHS (x-axis) and LHS (y-axis; MSE for ineq.\ref{['eq:cls_bound']} and ECE for ineq.\ref{['eq:cal_bound']}) of the inequalities in §\ref{['sec:theoritical_analysis']}. We denote MSE for the mean squared error, $\mathcal{L}_{\text{OC}}$ for the singular value regularization, and $\mathcal{L}_{\text{SD}}$ for the calibration regularization.
  • Figure 4: IN-C corruption-wise accuracy (top) and ECE (bottom). We evaluate accuracy and ECE over 15 types of image corruption with five corruption severity and report the average performance per corruption. CaRot consistently outperforms baseline methods across diverse corruptions.
  • Figure 5: Closer look at the effectiveness of CaRot on different corruptions. We provide IN-C accuracy on brightness (left) and elastic transform (right) corruptions. CaRot excels on the coarser corruption such as brightness whereas its effectiveness is weakened on the finer corruption such as elastic transform.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Theorem 3.1
  • Theorem C.1: Restatement of Theorem \ref{['thm:bound_new2']}.
  • proof
  • Definition C.2: $\mathcal{H}$-sqaure disagreement
  • Proposition C.3: OOD calibration error bound
  • Proposition C.4: OOD generalization error bound
  • Lemma C.5