Table of Contents
Fetching ...

FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs

Mothilal Asokan, Kebin Wu, Fatima Albreiki

TL;DR

FineLIP extends CLIP to handle long, detailed captions by stretching positional embeddings and introducing Adaptive Token Refinement Module (ATRM) to densify visual and textual tokens. It then applies a token-to-token Cross-Modal Late Interaction Module (CLIM) to perform fine-grained cross-modal alignment, trained with a bidirectional triplet margin loss. Evaluations on long-caption zero-shot retrieval and long-text-to-image generation demonstrate state-of-the-art performance across multiple backbones and datasets, with extensive ablations validating each component. The approach also enables integration with SDXL for longer prompt image generation and shows generalization benefits to short-caption tasks, suggesting broad practical impact for vision-language systems relying on detailed descriptions.

Abstract

As a pioneering vision-language model, CLIP (Contrastive Language-Image Pre-training) has achieved significant success across various domains and a wide range of downstream vision-language tasks. However, the text encoders in popular CLIP models are limited to processing only 77 text tokens, which constrains their ability to effectively handle longer, detail-rich captions. Additionally, CLIP models often struggle to effectively capture detailed visual and textual information, which hampers their performance on tasks that require fine-grained analysis. To address these limitations, we present a novel approach, \textbf{FineLIP}, that extends the capabilities of CLIP. FineLIP enhances cross-modal text-image mapping by incorporating \textbf{Fine}-grained alignment with \textbf{L}onger text input within the CL\textbf{IP}-style framework. FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens. The aggregated results are then used to enforce fine-grained token-to-token cross-modal alignment. We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation. Quantitative and qualitative experimental results demonstrate the effectiveness of FineLIP, outperforming existing state-of-the-art approaches. Furthermore, comprehensive ablation studies validate the benefits of key design elements within FineLIP.

FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs

TL;DR

FineLIP extends CLIP to handle long, detailed captions by stretching positional embeddings and introducing Adaptive Token Refinement Module (ATRM) to densify visual and textual tokens. It then applies a token-to-token Cross-Modal Late Interaction Module (CLIM) to perform fine-grained cross-modal alignment, trained with a bidirectional triplet margin loss. Evaluations on long-caption zero-shot retrieval and long-text-to-image generation demonstrate state-of-the-art performance across multiple backbones and datasets, with extensive ablations validating each component. The approach also enables integration with SDXL for longer prompt image generation and shows generalization benefits to short-caption tasks, suggesting broad practical impact for vision-language systems relying on detailed descriptions.

Abstract

As a pioneering vision-language model, CLIP (Contrastive Language-Image Pre-training) has achieved significant success across various domains and a wide range of downstream vision-language tasks. However, the text encoders in popular CLIP models are limited to processing only 77 text tokens, which constrains their ability to effectively handle longer, detail-rich captions. Additionally, CLIP models often struggle to effectively capture detailed visual and textual information, which hampers their performance on tasks that require fine-grained analysis. To address these limitations, we present a novel approach, \textbf{FineLIP}, that extends the capabilities of CLIP. FineLIP enhances cross-modal text-image mapping by incorporating \textbf{Fine}-grained alignment with \textbf{L}onger text input within the CL\textbf{IP}-style framework. FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens. The aggregated results are then used to enforce fine-grained token-to-token cross-modal alignment. We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation. Quantitative and qualitative experimental results demonstrate the effectiveness of FineLIP, outperforming existing state-of-the-art approaches. Furthermore, comprehensive ablation studies validate the benefits of key design elements within FineLIP.

Paper Structure

This paper contains 16 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Top-5 text-to-image retrieval results on Urban1k dataset longclip for L/14 variants of CLIP, Baseline and FineLIP (Ours), with image retrieval scores. The correct retrieved images are marked with green boxes. CLIP ignores the caption in bold due to the 77-token limit.
  • Figure 2: Overview of FineLIP architecture. Image-caption pair $(I,T)$ are passed through their respective encoders $f_v$ and $f_t$ to obtain the embeddings $\mathbb{V}$ and $\mathbb{T}$. The embeddings then are fed to the Adaptive Token Refinement Module to dynamically aggregate them into a set of representations $\mathbb{V'}$ and $\mathbb{T'}$ that offer improved information density. Finally, these aggregated tokens are forwarded into the Cross-Modal Late Interaction Module to achieve token-to-token fine-grained alignment. Note that the full caption $T$ is shortened for display.
  • Figure 3: Visualization of long text-to-image generations using different L/14 variants. GT means the ground-truth images paired with the captions in image generation. Zoom in for better visualization. Note that the captions used as well as the detailed analysis of these examples are included in the supplementary material.
  • Figure 4: Zero-shot Classification on DataComp-38
  • Figure 5: Top-5 text-to-image retrieval results on Urban1k dataset longclip for L/14 variants of CLIP, Baseline, SPARC sparc, LAPS laps and FineLIP (Ours), with retrieval scores. The correct retrieved images are marked with green boxes. CLIP ignores the caption in bold due to the 77 token limit.
  • ...and 1 more figures