CRoFT: Robust Fine-Tuning with Concurrent Optimization for OOD Generalization and Open-Set OOD Detection
Lin Zhu, Yifeng Yang, Qinying Gu, Xinbing Wang, Chenghu Zhou, Nanyang Ye
TL;DR
CRoFT tackles the dual challenge of improving out-of-distribution generalization under covariate shifts and detecting open-set OOD during fine-tuning of vision-language models. It introduces the energy distribution reshaping (EDR) loss to sharpen open-set detection and demonstrates that minimizing the gradient of energy scores implicitly aligns Hessians across domains, enabling a Hessian-based OOD generalization bound. A worst-case covariate-shift feature generator and lightweight adapters facilitate concurrent optimization of both tasks, formalized in the final CRoFT objective: $\mathcal{L}_{CRoFT} = \hat{\mathcal{E}}_S(\boldsymbol{\uptheta}) + \lambda_1 \mathcal{L}_c + \lambda_2 \left( \mathcal{L}_e(z_I) + \mathcal{L}_e(z_I^c) \right)$. Empirically, CRoFT achieves state-of-the-art results on Setup-I (open-set OOD detection and closed-set OOD generalization) and Setup-II (cross-dataset OOD detection), with substantial improvements in FPR95 and AUROC, validating both the theory and the practical impact for robust, open-world VL-PTM fine-tuning.
Abstract
Recent vision-language pre-trained models (VL-PTMs) have shown remarkable success in open-vocabulary tasks. However, downstream use cases often involve further fine-tuning of VL-PTMs, which may distort their general knowledge and impair their ability to handle distribution shifts. In real-world scenarios, machine learning systems inevitably encounter both covariate shifts (e.g., changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of enhancing out-of-distribution (OOD) generalization on covariate shifts and simultaneously detecting semantic-shifted unseen classes. Thus a critical but underexplored question arises: How to improve VL-PTMs' generalization ability to closed-set OOD data, while effectively detecting open-set unseen classes during fine-tuning? In this paper, we propose a novel objective function of OOD detection that also serves to improve OOD generalization. We show that minimizing the gradient magnitude of energy scores on training data leads to domain-consistent Hessians of classification loss, a strong indicator for OOD generalization revealed by theoretical analysis. Based on this finding, we have developed a unified fine-tuning framework that allows for concurrent optimization of both tasks. Extensive experiments have demonstrated the superiority of our method. The code is available at https://github.com/LinLLLL/CRoFT.
