Table of Contents
Fetching ...

Realistic Unsupervised CLIP Fine-tuning with Universal Entropy Optimization

Jian Liang, Lijun Sheng, Zhengbo Wang, Ran He, Tieniu Tan

TL;DR

This work tackles realistic unsupervised fine-tuning of CLIP when unlabeled data may include out-of-distribution (OOD) samples. It introduces Universal Entropy Optimization (UEO), which uses sample-level confidence to approximately minimize the entropy of in-distribution predictions while increasing the entropy of potential OOD predictions, formalized with a weight-based entropy objective and a reverse weighting scheme; the approach also updates textual prompts and channel-wise affine parameters in the visual branch for efficiency. Extensive experiments across 15 domains and four category-shift scenarios show that UEO consistently improves both ID generalization and OOD detection compared to strong baselines, with additional gains when using normalized affine layers. The results demonstrate that a simple, parameter-efficient strategy can robustly adapt CLIP to open-world unlabeled data, offering practical benefits for real-world deployment of vision-language models.

Abstract

The emergence of vision-language models, such as CLIP, has spurred a significant research effort towards their application for downstream supervised learning tasks. Although some previous studies have explored the unsupervised fine-tuning of CLIP, they often rely on prior knowledge in the form of class names associated with ground truth labels. This paper explores a realistic unsupervised fine-tuning scenario, considering the presence of out-of-distribution samples from unknown classes within the unlabeled data. In particular, we focus on simultaneously enhancing out-of-distribution detection and the recognition of instances associated with known classes. To tackle this problem, we present a simple, efficient, and effective approach called Universal Entropy Optimization (UEO). UEO leverages sample-level confidence to approximately minimize the conditional entropy of confident instances and maximize the marginal entropy of less confident instances. Apart from optimizing the textual prompt, UEO incorporates optimization of channel-wise affine transformations within the visual branch of CLIP. Extensive experiments across 15 domains and 4 different types of prior knowledge validate the effectiveness of UEO compared to baseline methods. The code is publicly available at \url{https://github.com/tim-learn/UEO}.

Realistic Unsupervised CLIP Fine-tuning with Universal Entropy Optimization

TL;DR

This work tackles realistic unsupervised fine-tuning of CLIP when unlabeled data may include out-of-distribution (OOD) samples. It introduces Universal Entropy Optimization (UEO), which uses sample-level confidence to approximately minimize the entropy of in-distribution predictions while increasing the entropy of potential OOD predictions, formalized with a weight-based entropy objective and a reverse weighting scheme; the approach also updates textual prompts and channel-wise affine parameters in the visual branch for efficiency. Extensive experiments across 15 domains and four category-shift scenarios show that UEO consistently improves both ID generalization and OOD detection compared to strong baselines, with additional gains when using normalized affine layers. The results demonstrate that a simple, parameter-efficient strategy can robustly adapt CLIP to open-world unlabeled data, offering practical benefits for real-world deployment of vision-language models.

Abstract

The emergence of vision-language models, such as CLIP, has spurred a significant research effort towards their application for downstream supervised learning tasks. Although some previous studies have explored the unsupervised fine-tuning of CLIP, they often rely on prior knowledge in the form of class names associated with ground truth labels. This paper explores a realistic unsupervised fine-tuning scenario, considering the presence of out-of-distribution samples from unknown classes within the unlabeled data. In particular, we focus on simultaneously enhancing out-of-distribution detection and the recognition of instances associated with known classes. To tackle this problem, we present a simple, efficient, and effective approach called Universal Entropy Optimization (UEO). UEO leverages sample-level confidence to approximately minimize the conditional entropy of confident instances and maximize the marginal entropy of less confident instances. Apart from optimizing the textual prompt, UEO incorporates optimization of channel-wise affine transformations within the visual branch of CLIP. Extensive experiments across 15 domains and 4 different types of prior knowledge validate the effectiveness of UEO compared to baseline methods. The code is publicly available at \url{https://github.com/tim-learn/UEO}.
Paper Structure (20 sections, 4 equations, 6 figures, 11 tables, 1 algorithm)

This paper contains 20 sections, 4 equations, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: The basic setup of Unsupervised Universal Fine-Tuning (U$^2$-FT). During the training phase, U$^2$-FT fine-tunes the pre-trained CLIP with unlabeled in-the-wild training data according to an imprecise predefined list of class names (where 'fox' may be absent in C1 and 'panda' may be included in C2). Note that, a training-independent data set containing both known classes and unknown classes (OOD) in the test phase is employed to evaluate performance in both generalization and OOD detection.
  • Figure 2: OS and HOS scores bucci2020effectiveness of different methods with the change of threshold under open-partial category shift on the Cl domain of OfficeHome venkateswara2017deep are shown in (a-b). The relationship between AUC and the maximum HOS score is depicted in (c) for four different domains.
  • Figure 3: Different optimization designs on (Sk) in DomainNet and (Re) in OfficeHome, (open-partial-set, ResNet-50).
  • Figure 4: Results of different hyperparameters on (Sk) in DomainNet (open-partial-set, ResNet-50).
  • Figure 5: Different optimization designs on (Sk) in DomainNet and (Re) in OfficeHome, (closed-set, ResNet-50).
  • ...and 1 more figures