Table of Contents
Fetching ...

Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

Konstantin Schall, Kai Uwe Barthel, Nico Hezel, Klaus Jung

TL;DR

CLIP's joint embeddings excel at text-to-image tasks but struggle with instance-level image retrieval due to semantic-visual gaps. The authors propose two retrieval-focused strategies: a two-stage fine-tuning (2SFT) that first improves image retrieval and then realigns the text encoder, and MCIP, which incorporates pseudo-captions via a Multi-Caption-ArcMargin loss to directly align image and multiple captions in the retrieval phase. Across multiple models and benchmarks, MCIP (often with a subsequent re-alignment) yields strong gains in image retrieval, k-NN, and zero-shot classification while preserving text-to-image retrieval, enabling a single embedding per image for large-scale search systems. The work formalizes the training objectives with $L_{InfoNCE}$, $L_{ArcMargin}$, and $L_{MCArcMargin}$ and provides code and optimized weights for reproducibility.

Abstract

Contrastive Language and Image Pairing (CLIP), a transformative method in multimedia retrieval, typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. However, when applied directly, these models often struggle to differentiate between visually distinct images that have similar captions, resulting in suboptimal performance for image-based similarity searches. This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios, while maintaining their effectiveness in text-based search tasks such as text-to-image retrieval and zero-shot classification. We propose and evaluate two novel methods aimed at refining the retrieval capabilities of CLIP without compromising the alignment between text and image embeddings. The first method involves a sequential fine-tuning process: initially optimizing the image encoder for more precise image retrieval and subsequently realigning the text encoder to these optimized image embeddings. The second approach integrates pseudo-captions during the retrieval-optimization phase to foster direct alignment within the embedding space. Through comprehensive experiments, we demonstrate that these methods enhance CLIP's performance on various benchmarks, including image retrieval, k-NN classification, and zero-shot text-based classification, while maintaining robustness in text-to-image retrieval. Our optimized models permit maintaining a single embedding per image, significantly simplifying the infrastructure needed for large-scale multi-modal similarity search systems.

Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

TL;DR

CLIP's joint embeddings excel at text-to-image tasks but struggle with instance-level image retrieval due to semantic-visual gaps. The authors propose two retrieval-focused strategies: a two-stage fine-tuning (2SFT) that first improves image retrieval and then realigns the text encoder, and MCIP, which incorporates pseudo-captions via a Multi-Caption-ArcMargin loss to directly align image and multiple captions in the retrieval phase. Across multiple models and benchmarks, MCIP (often with a subsequent re-alignment) yields strong gains in image retrieval, k-NN, and zero-shot classification while preserving text-to-image retrieval, enabling a single embedding per image for large-scale search systems. The work formalizes the training objectives with , , and and provides code and optimized weights for reproducibility.

Abstract

Contrastive Language and Image Pairing (CLIP), a transformative method in multimedia retrieval, typically trains two neural networks concurrently to generate joint embeddings for text and image pairs. However, when applied directly, these models often struggle to differentiate between visually distinct images that have similar captions, resulting in suboptimal performance for image-based similarity searches. This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios, while maintaining their effectiveness in text-based search tasks such as text-to-image retrieval and zero-shot classification. We propose and evaluate two novel methods aimed at refining the retrieval capabilities of CLIP without compromising the alignment between text and image embeddings. The first method involves a sequential fine-tuning process: initially optimizing the image encoder for more precise image retrieval and subsequently realigning the text encoder to these optimized image embeddings. The second approach integrates pseudo-captions during the retrieval-optimization phase to foster direct alignment within the embedding space. Through comprehensive experiments, we demonstrate that these methods enhance CLIP's performance on various benchmarks, including image retrieval, k-NN classification, and zero-shot text-based classification, while maintaining robustness in text-to-image retrieval. Our optimized models permit maintaining a single embedding per image, significantly simplifying the infrastructure needed for large-scale multi-modal similarity search systems.
Paper Structure (13 sections, 5 equations, 5 figures, 1 table)

This paper contains 13 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: General-Purpose-Retrieval Fine-Tuning: The image projector is discarded and the image encoder is solely fine-tuned with the ArcMargin loss on a dataset consisting of 22.6 million images from 168 thousand classes, with one class per image.
  • Figure 2: Re-Alignment Fine-Tuning: The fine-tuned image-encoder is locked and the other three parts of the model are trained to re-align the text embeddings with the image embeddings. An image-captions paired dataset is used in this step.
  • Figure 3: Generated pseudo-captions for three images from one example class of the GPR-FT training set. Even though the texts are redundant, the combination of several captions succeeds in describing the content of the images.
  • Figure 4: Multi-Caption Image Pairing Fine-Tuning: This method utilizes the generated pseudo-captions to the train the image encoder and projector while keeping the text encoder and projector locked to maintain alignment of the joint-embedding space in a single fine-tuning run. In addition to the single-label training with the ArcMargin loss, the introduced Multi-Caption-ArcMargin loss is used in this step.
  • Figure 5: Highlighted results for three different models fine-tuned with our proposed MCIP approach. In all three cases image-based similarity search could significantly be improved while maintaining text-to-image retrieval qualities and additionally improving zero-shot classification capabilities.