Table of Contents
Fetching ...

Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization

Yuhang Zang, Hanlin Goh, Josh Susskind, Chen Huang

TL;DR

This work tackles the OOD generalization limitations of finetuned vision-language models, showing that long finetuning without regularization overfits to base classes and harms unknown-concept recognition. It introduces OGEN, combining a lightweight class-conditional feature generator that synthesizes OOD image features from unknown class names with an adaptive local mean-teacher distillation to regularize joint optimization. Across 11 datasets and multiple prompt-based baselines, OGEN delivers consistent improvements in OOD/generalization trade-offs while maintaining or modestly improving in-distribution accuracy. The approach is model-agnostic and data-efficient, offering a practical path to more robust open-domain vision-language systems.

Abstract

Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distribution (OOD) samples, but also show some improvements in both ID and OOD accuracies. In this paper, we first demonstrate that vision-language models, after long enough finetuning but without proper regularization, tend to overfit the known classes in the given dataset, with degraded performance on unknown classes. Then we propose a novel approach OGEN to address this pitfall, with the main focus on improving the OOD GENeralization of finetuned models. Specifically, a class-conditional feature generator is introduced to synthesize OOD features using just the class name of any unknown class. Such synthesized features will provide useful knowledge about unknowns and help regularize the decision boundary between ID and OOD data when optimized jointly. Equally important is our adaptive self-distillation mechanism to regularize our feature generation model during joint optimization, i.e., adaptively transferring knowledge between model states to further prevent overfitting. Experiments validate that our method yields convincing gains in OOD generalization performance in different settings. Code: https://github.com/apple/ml-ogen.

Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization

TL;DR

This work tackles the OOD generalization limitations of finetuned vision-language models, showing that long finetuning without regularization overfits to base classes and harms unknown-concept recognition. It introduces OGEN, combining a lightweight class-conditional feature generator that synthesizes OOD image features from unknown class names with an adaptive local mean-teacher distillation to regularize joint optimization. Across 11 datasets and multiple prompt-based baselines, OGEN delivers consistent improvements in OOD/generalization trade-offs while maintaining or modestly improving in-distribution accuracy. The approach is model-agnostic and data-efficient, offering a practical path to more robust open-domain vision-language systems.

Abstract

Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distribution (OOD) samples, but also show some improvements in both ID and OOD accuracies. In this paper, we first demonstrate that vision-language models, after long enough finetuning but without proper regularization, tend to overfit the known classes in the given dataset, with degraded performance on unknown classes. Then we propose a novel approach OGEN to address this pitfall, with the main focus on improving the OOD GENeralization of finetuned models. Specifically, a class-conditional feature generator is introduced to synthesize OOD features using just the class name of any unknown class. Such synthesized features will provide useful knowledge about unknowns and help regularize the decision boundary between ID and OOD data when optimized jointly. Equally important is our adaptive self-distillation mechanism to regularize our feature generation model during joint optimization, i.e., adaptively transferring knowledge between model states to further prevent overfitting. Experiments validate that our method yields convincing gains in OOD generalization performance in different settings. Code: https://github.com/apple/ml-ogen.
Paper Structure (15 sections, 7 equations, 7 figures, 6 tables)

This paper contains 15 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: (a) We study OOD generalization when finetuning the vision-language model CLIP on various downstream tasks. We consider both within-dataset generalization where one dataset has ID vs. OOD (or known vs. unknown) class splits for finetuning and evaluation respectively, and the more challenging cross-dataset generalization setting. More clarifications on the problem definition in Appendix \ref{['sec:appendix_a']}. (b) Examples of within-dataset generalization: we show learning curves of the prompt learning method CoOp zhou2021coop that finetunes CLIP for long enough (200 epochs) on three datasets (more in Appendix \ref{['sec:appendix_more_observation']}). Apparently, CoOp overfits the known classes of each dataset with notable accuracy drop on the unknowns. Our proposed method OGEN largely reduces such overfitting through effective regularization.
  • Figure 2: (a) To improve OOD generalization, we propose to gain knowledge of unknown classes by directly synthesizing their image features. This helps to learn a more reliable decision boundary between known and unknown classes in the feature space. (b) Prompt learning based on discriminating both the known and synthesized unknown features (from our class-conditional feature generator $\theta$, see details in text). (c) Implementation of $\theta$ using a lightweight attention module.
  • Figure 3: Visualizing image feature synthesis based on the joint extrapolation scheme (Eq. (\ref{['eq5']})) on Flowers102 dataset. Note our feature generator is not trained on the unknown classes, but can still synthesize faithful image features (red triangle) lying close to the real ones (gray cross). This is achieved by extrapolating an unseen instance from the kNN class examples (only a random one per kNN class is used), effectively combining their related patterns like the shape and texture of flowers.
  • Figure 4: More example learning curves of the long finetuning runs (200 epochs) with CoOp zhou2021coop method. Under the within-dataset generalization setting, CoOp typically overfits the known classes and achieves decreasing accuracy for the unknown classes. The class-conditional feature generator plays a key role in our full method OGEN, which reduces overfitting by generating OOD features for the unknown-aware optimization. Our adaptive self-distillation method further reduces overfitting via regularizing the optimization dynamics.
  • Figure 5: Learning curves of the long finetuning runs (100 epochs) with KgCoOp yao2023visual vs. OGEN-KgCoOp methods (within-dataset generalization setting). Despite the overfitting reducing technique used in KgCoOp, it still suffers from some extent of overfitting. See how OGEN often improves the learning curves of both base and new classes.
  • ...and 2 more figures