Table of Contents
Fetching ...

$Δ\mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization

Lin Zhu, Yifeng Yang, Xinbing Wang, Qinying Gu, Nanyang Ye

TL;DR

This work tackles the challenge of robustly fine-tuning vision-language models to perform well on both covariate-shifted closed-set OOD and open-set semantic OOD. It introduces $\Delta\mathrm{Energy}$, an energy-based OOD score that measures the energy change when re-aligning vision-language representations by cropping the top-c cosine similarities, and proves it provides better separation between ID and OOD than prior methods. To jointly improve detection and generalization, the authors propose an $\mathrm{EBM}$ bound-maximization loss that increases the lower bound of $\Delta\mathrm{Energy}$ and yields domain-consistent Hessians, enabling a unified prompt-tuning framework. Through extensive experiments on challenging OOD benchmarks (including ImageNet-1k, cross-dataset, and hard OOD splits), the approach achieves substantial gains (10–25% in AUROC) over strong baselines, demonstrating robust OOD handling for VLMs in practical deployment.

Abstract

Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs' generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities (specifically by directly reducing the maximum cosine similarity to a low value), we introduce a novel OOD score, named ΔEnergy. ΔEnergy significantly outperforms the vanilla energy-based OOD score and provides a more reliable approach for OOD detection. Furthermore, ΔEnergy can simultaneously improve OOD generalization under covariate shifts, which is achieved by lower-bound maximization for ΔEnergy (termed EBM). EBM is theoretically proven to not only enhance OOD detection but also yields a domain-consistent Hessian, which serves as a strong indicator for OOD generalization. Based on this finding, we developed a unified fine-tuning framework that allows for improving VLMs' robustness in both OOD generalization and OOD detection. Extensive experiments on challenging OOD detection and generalization benchmarks demonstrate the superiority of our method, outperforming recent approaches by 10% to 25% in AUROC.

$Δ\mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization

TL;DR

This work tackles the challenge of robustly fine-tuning vision-language models to perform well on both covariate-shifted closed-set OOD and open-set semantic OOD. It introduces , an energy-based OOD score that measures the energy change when re-aligning vision-language representations by cropping the top-c cosine similarities, and proves it provides better separation between ID and OOD than prior methods. To jointly improve detection and generalization, the authors propose an bound-maximization loss that increases the lower bound of and yields domain-consistent Hessians, enabling a unified prompt-tuning framework. Through extensive experiments on challenging OOD benchmarks (including ImageNet-1k, cross-dataset, and hard OOD splits), the approach achieves substantial gains (10–25% in AUROC) over strong baselines, demonstrating robust OOD handling for VLMs in practical deployment.

Abstract

Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs' generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities (specifically by directly reducing the maximum cosine similarity to a low value), we introduce a novel OOD score, named ΔEnergy. ΔEnergy significantly outperforms the vanilla energy-based OOD score and provides a more reliable approach for OOD detection. Furthermore, ΔEnergy can simultaneously improve OOD generalization under covariate shifts, which is achieved by lower-bound maximization for ΔEnergy (termed EBM). EBM is theoretically proven to not only enhance OOD detection but also yields a domain-consistent Hessian, which serves as a strong indicator for OOD generalization. Based on this finding, we developed a unified fine-tuning framework that allows for improving VLMs' robustness in both OOD generalization and OOD detection. Extensive experiments on challenging OOD detection and generalization benchmarks demonstrate the superiority of our method, outperforming recent approaches by 10% to 25% in AUROC.

Paper Structure

This paper contains 19 sections, 7 theorems, 48 equations, 3 figures, 12 tables, 1 algorithm.

Key Result

Theorem 3.2

[OOD Detection Ability of $\boldsymbol{\Delta\mathrm{Energy}}$] Suppose that the maximum cosine similarity for an ID sample $\mathbf{x}_{\text{ID}}$ is greater than that of an open-set OOD sample $\mathbf{x}_{\text{OOD}}$, i.e., $s_{\hat{y}_1}(\mathbf{x}_{\text{ID}}) > s_{\hat{y}_1}(\mathbf{x}_{\tex

Figures (3)

  • Figure 1: (A) Illustration of $\Delta\mathrm{Energy}$ for OOD detection. Significant differences in $\Delta\mathrm{Energy}$ are observed between closed-set data and open-set OOD data when the maximum cosine similarity is cropped to zero. (B) Illustration of the $\Delta\mathrm{Energy}$ for OOD generalization. We introduce the EBM method to achieve domain-consistent Hessians, which simultaneously triggers bound optimization for $\Delta\mathrm{Energy}$. More details are in Section \ref{['sec:EBM']}. (C) Comparison between our $\Delta\mathrm{Energy}$ and EBM with state-of-the-art methods. In the radar plots, all values are normalized to the range [0, 1]. It is observed that recent methods aimed at improving VLMs' OOD detection may not scale well to handling different types of distribution shifts in challenging ImageNet-1k OOD datasets.
  • Figure 2: Overview of the proposed method. Based on the prompt-tuning approach, we freeze both the image encoder and the text encoder, making only the context vectors ($\uptheta=[\uptheta_1, \cdots, \uptheta_n]$) learnable under the proposed objective function, as shown in Equation \ref{['eq:final_loss']}. During fine-tuning, we apply a masking operation to each ID image feature based on the top-1 similarity, as defined in Equation \ref{['eq:mask-image']}. We then compute the resulting energy change after modifying the vision-language alignment via masking, which allows us to perform bound optimization on $\Delta \mathrm{Energy}$. In the inference phase, following Equation \ref{['eq:mask-text']}, we reset the top-$c$ cosine similarities and then compute $\Delta\mathrm{Energy}$ for OOD detection. Simultaneously, we use the fine-tuned text feature and unmasked image feature for classification at test time. The complete algorithm can be seen in Appendix \ref{['app:exp_details']}.
  • Figure 3: The significant prediction difference between closed-set data and open-set OOD data when vision-language re-alignment is applied to the zero-shot CLIP model CLIP. This difference offers a novel approach to distinguishing between closed-set and open-set classes. Based on the element-wise product between CLIP's image and text features, the masked ZS CLIP(+) model zeroes out the elements of the image feature where the corresponding values in the product are negative. In contrast, the opposite operation is applied in ZS CLIP(-). It is observed that masking the elements where $P_j < 0$ preserves the model’s original attention, which motivates us to leverage this consistency between the original and masked domains to improve OOD generalization.

Theorems & Definitions (7)

  • Theorem 3.2
  • Theorem 3.3
  • Theorem 3.4
  • Theorem 3.5
  • Proposition 3.6
  • Theorem B.1
  • Theorem C.2