Table of Contents
Fetching ...

CRoFT: Robust Fine-Tuning with Concurrent Optimization for OOD Generalization and Open-Set OOD Detection

Lin Zhu, Yifeng Yang, Qinying Gu, Xinbing Wang, Chenghu Zhou, Nanyang Ye

TL;DR

CRoFT tackles the dual challenge of improving out-of-distribution generalization under covariate shifts and detecting open-set OOD during fine-tuning of vision-language models. It introduces the energy distribution reshaping (EDR) loss to sharpen open-set detection and demonstrates that minimizing the gradient of energy scores implicitly aligns Hessians across domains, enabling a Hessian-based OOD generalization bound. A worst-case covariate-shift feature generator and lightweight adapters facilitate concurrent optimization of both tasks, formalized in the final CRoFT objective: $\mathcal{L}_{CRoFT} = \hat{\mathcal{E}}_S(\boldsymbol{\uptheta}) + \lambda_1 \mathcal{L}_c + \lambda_2 \left( \mathcal{L}_e(z_I) + \mathcal{L}_e(z_I^c) \right)$. Empirically, CRoFT achieves state-of-the-art results on Setup-I (open-set OOD detection and closed-set OOD generalization) and Setup-II (cross-dataset OOD detection), with substantial improvements in FPR95 and AUROC, validating both the theory and the practical impact for robust, open-world VL-PTM fine-tuning.

Abstract

Recent vision-language pre-trained models (VL-PTMs) have shown remarkable success in open-vocabulary tasks. However, downstream use cases often involve further fine-tuning of VL-PTMs, which may distort their general knowledge and impair their ability to handle distribution shifts. In real-world scenarios, machine learning systems inevitably encounter both covariate shifts (e.g., changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of enhancing out-of-distribution (OOD) generalization on covariate shifts and simultaneously detecting semantic-shifted unseen classes. Thus a critical but underexplored question arises: How to improve VL-PTMs' generalization ability to closed-set OOD data, while effectively detecting open-set unseen classes during fine-tuning? In this paper, we propose a novel objective function of OOD detection that also serves to improve OOD generalization. We show that minimizing the gradient magnitude of energy scores on training data leads to domain-consistent Hessians of classification loss, a strong indicator for OOD generalization revealed by theoretical analysis. Based on this finding, we have developed a unified fine-tuning framework that allows for concurrent optimization of both tasks. Extensive experiments have demonstrated the superiority of our method. The code is available at https://github.com/LinLLLL/CRoFT.

CRoFT: Robust Fine-Tuning with Concurrent Optimization for OOD Generalization and Open-Set OOD Detection

TL;DR

CRoFT tackles the dual challenge of improving out-of-distribution generalization under covariate shifts and detecting open-set OOD during fine-tuning of vision-language models. It introduces the energy distribution reshaping (EDR) loss to sharpen open-set detection and demonstrates that minimizing the gradient of energy scores implicitly aligns Hessians across domains, enabling a Hessian-based OOD generalization bound. A worst-case covariate-shift feature generator and lightweight adapters facilitate concurrent optimization of both tasks, formalized in the final CRoFT objective: . Empirically, CRoFT achieves state-of-the-art results on Setup-I (open-set OOD detection and closed-set OOD generalization) and Setup-II (cross-dataset OOD detection), with substantial improvements in FPR95 and AUROC, validating both the theory and the practical impact for robust, open-world VL-PTM fine-tuning.

Abstract

Recent vision-language pre-trained models (VL-PTMs) have shown remarkable success in open-vocabulary tasks. However, downstream use cases often involve further fine-tuning of VL-PTMs, which may distort their general knowledge and impair their ability to handle distribution shifts. In real-world scenarios, machine learning systems inevitably encounter both covariate shifts (e.g., changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of enhancing out-of-distribution (OOD) generalization on covariate shifts and simultaneously detecting semantic-shifted unseen classes. Thus a critical but underexplored question arises: How to improve VL-PTMs' generalization ability to closed-set OOD data, while effectively detecting open-set unseen classes during fine-tuning? In this paper, we propose a novel objective function of OOD detection that also serves to improve OOD generalization. We show that minimizing the gradient magnitude of energy scores on training data leads to domain-consistent Hessians of classification loss, a strong indicator for OOD generalization revealed by theoretical analysis. Based on this finding, we have developed a unified fine-tuning framework that allows for concurrent optimization of both tasks. Extensive experiments have demonstrated the superiority of our method. The code is available at https://github.com/LinLLLL/CRoFT.
Paper Structure (17 sections, 7 theorems, 33 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 17 sections, 7 theorems, 33 equations, 5 figures, 10 tables, 1 algorithm.

Key Result

Proposition 3.1

[Energy distribution reshaping (EDR) loss] Given the training data $\widehat{\mathcal{D}}_\mathcal{S}$, in our fine-tuning framework, we calculate the training data's energy scores based on Equation eq:energy_score. To approach the solution of ${\min E_{\boldsymbol{\uptheta}}(\mathbf{x})}$, i.e., ${ where ${\boldsymbol{\uptheta}} = \{\boldsymbol{\uptheta}_0, \boldsymbol{\uptheta}_l\}$, ${\boldsymb

Figures (5)

  • Figure 1: Illustration of typical data setting in real-world scenarios. For example, we may encounter various types of data in real-world applications: (i) closed-set ID data (e.g., dog), (ii) closed-set OOD data with covariate shifts (e.g., dog with changed image styles), and (iii) open-set OOD data with semantic shifts (e.g., panda). The significant overlaps in energy distributions between closed-set ID and open-set OOD data pose a challenge for CLIP in detecting open-set OOD data. The notable discrepancy between the closed-set ID and closed-set OOD data also complicates achieving OOD generalization for closed-set OOD data.
  • Figure 2: Overview of our CRoFT framework. Our theoretical analysis leads to the design of a new fine-tuning framework. As shown in Figure (a), we inject adapters, i.e., one-layer linear projections after the CLIP's pre-trained encoders. Based on the adapted image feature $\mathbf{z_I}$ and adapted text feature $\mathbf{z_T}$, we generate the most challenging covariate-shifted OOD image features $\mathbf{z_I^c}$, simulating the worst-case scenarios. The corresponding generation process, depicted in Figure (b), follows the criterion defined in Equation \ref{['eq:ood_generator']}, which preserves semantic information to maintain classification accuracy but differs from the ID image feature $\mathbf{z_I}$. Finally, as shown in Figure (c), we optimize on the generated $\mathbf{z_I^c}$ using the proposed loss $\mathcal{L}_{\text{c}}$. Meanwhile, we minimize the classification loss (cross-entropy) on the ID image features, denoted as $\widehat{\mathcal{E}}_{\mathcal{S}}({\boldsymbol{\uptheta}})$, while reshaping the energy distribution for $\mathbf{z_I}$ and $\mathbf{z_I^c}$ through the EDR loss (i.e., $\mathcal{L}(\mathbf{z_I})$ and $\mathcal{L}(\mathbf{z_I^c})$).
  • Figure 3: Ablations on the proposed EDR loss. With the proposed EDR loss $\mathcal{L}_{\text{e}}$, our method successfully fine-tunes CLIP's features in the direction of better-discriminating open-set and closed-set, without sacrificing the test accuracy.
  • Figure 4: Ablation Study results of Setup-II. (a): Comparison with CLIP-Adapter in open-set OOD detection by referring KNN distances on the adapted image features. (b): t-SNE visualization of image features. (c): Average KNN distance between OOD features and ID features. (We use VLCS as the closed-set data in (b) and (c).) (d): Experiment results on VLCS for CRoFT with $\mathcal{L}_{\text{c}}$ vs. without $\mathcal{L}_{\text{c}}$.
  • Figure 5: (a): Examples of three types of data in Setup-I: (i) closed-set ID data (e.g., broccoli), (ii) closed-set OOD data (e.g., broccoli with changed image styles), and (iii) open-set OOD data (e.g., goose). (b): CLIP's energy distributions on different types of data. (c): CRoFT's energy distributions on different types of data. (d): Examples of the closed-set data and cross-dataset open-set OOD data in Setup-II, where we use the PACS dataset as the closed-set data. (e): Visualization of image features. All image features are reduced to the 1-dimensional space by t-SNE. (f): The FPR95 and AUROC results in discriminating closed-set OOD and open-set OOD data.

Theorems & Definitions (7)

  • Proposition 3.1
  • Theorem 3.3
  • Proposition 3.4
  • Lemma 3.5
  • Theorem 3.6
  • Theorem 1.2
  • Lemma 2.1