Table of Contents
Fetching ...

LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

Anh-Quan Cao, Maximilian Jaritz, Matthieu Guillaumin, Raoul de Charette, Loris Bazzani

TL;DR

This work proposes LATTECLIP, an unsupervised method for fine-tuning CLIP models on classification with known class names in custom domains, without relying on human annotations, and leverages Large Multimodal Models (LMMs) to generate expressive textual descriptions for both individual images and groups of images.

Abstract

Large-scale vision-language pre-trained (VLP) models (e.g., CLIP) are renowned for their versatility, as they can be applied to diverse applications in a zero-shot setup. However, when these models are used in specific domains, their performance often falls short due to domain gaps or the under-representation of these domains in the training data. While fine-tuning VLP models on custom datasets with human-annotated labels can address this issue, annotating even a small-scale dataset (e.g., 100k samples) can be an expensive endeavor, often requiring expert annotators if the task is complex. To address these challenges, we propose LatteCLIP, an unsupervised method for fine-tuning CLIP models on classification with known class names in custom domains, without relying on human annotations. Our method leverages Large Multimodal Models (LMMs) to generate expressive textual descriptions for both individual images and groups of images. These provide additional contextual information to guide the fine-tuning process in the custom domains. Since LMM-generated descriptions are prone to hallucination or missing details, we introduce a novel strategy to distill only the useful information and stabilize the training. Specifically, we learn rich per-class prototype representations from noisy generated texts and dual pseudo-labels. Our experiments on 10 domain-specific datasets show that LatteCLIP outperforms pre-trained zero-shot methods by an average improvement of +4.74 points in top-1 accuracy and other state-of-the-art unsupervised methods by +3.45 points.

LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

TL;DR

This work proposes LATTECLIP, an unsupervised method for fine-tuning CLIP models on classification with known class names in custom domains, without relying on human annotations, and leverages Large Multimodal Models (LMMs) to generate expressive textual descriptions for both individual images and groups of images.

Abstract

Large-scale vision-language pre-trained (VLP) models (e.g., CLIP) are renowned for their versatility, as they can be applied to diverse applications in a zero-shot setup. However, when these models are used in specific domains, their performance often falls short due to domain gaps or the under-representation of these domains in the training data. While fine-tuning VLP models on custom datasets with human-annotated labels can address this issue, annotating even a small-scale dataset (e.g., 100k samples) can be an expensive endeavor, often requiring expert annotators if the task is complex. To address these challenges, we propose LatteCLIP, an unsupervised method for fine-tuning CLIP models on classification with known class names in custom domains, without relying on human annotations. Our method leverages Large Multimodal Models (LMMs) to generate expressive textual descriptions for both individual images and groups of images. These provide additional contextual information to guide the fine-tuning process in the custom domains. Since LMM-generated descriptions are prone to hallucination or missing details, we introduce a novel strategy to distill only the useful information and stabilize the training. Specifically, we learn rich per-class prototype representations from noisy generated texts and dual pseudo-labels. Our experiments on 10 domain-specific datasets show that LatteCLIP outperforms pre-trained zero-shot methods by an average improvement of +4.74 points in top-1 accuracy and other state-of-the-art unsupervised methods by +3.45 points.

Paper Structure

This paper contains 13 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of LatteCLIP. Our prototype-based method leverages different types of pseudo-labels and LMM-synthetic texts for improved unsupervised CLIP fine-tuning on domain-specific datasets (e.g., texture). During inference, image features are compared with prototypes to generate predictions. Here, $f(\cdot)$ and $g(\cdot)$ are the CLIP image and text encoders, respectively.
  • Figure 2: Text Generation with LMM. In addition to the usual class-description (middle), combining template text and pseudo-label, we leverage LMM LLAVA to generate image-description (top) which provide more expressive visual description of the image. Further, by considering random group of images with the same pseudo-labels, we prompt LLAVA to capture shared characteristics as group-description (bottom).
  • Figure 3: Training. For image $x$, we predict pseudo-label $c\in\{c_{\rm zs}, c_{\rm ft}\}$ and create three type of descriptions per pseudo-label as described in \ref{['sec:text_gen']}. Our Dynamic Feature Mixer combines these descriptions with the corresponding prototype $p_c$ to produce a prototype-text embedding $\bar{t}$, which updates the prototype $p_c$. Lastly, the contrastive loss \ref{['eq:contrastive-loss']} is computed between $\bar{t}$ and and the image embedding $f(x)$.
  • Figure 4: Dynamic Feature Mixer. We compute cosine similarities between each text feature and all prototypes. Weights are determined by the difference between the top two similarity scores. We calculate a weighted average of the features and combine it with the prototype (\ref{['sec:proto-learning']}), creating a representation relevant to the input prototype yet distinct from others.
  • Figure 5: Examples of generated captions. We either generate a caption from the group of 4 images, by inputting them as tiled single image into LLaVA (${T^{\rm group}}$), or we input a single image to LLaVA (${T^{\rm image}}$). For simplicity, in this figure, we only show a single image caption (highlighted by red bounding box).
  • ...and 2 more figures