Table of Contents
Fetching ...

MulCLIP: A Multi-level Alignment Framework for Enhancing Fine-grained Long-context CLIP

Chau Truong, Hieu Ta Quang, Dung D. Le

TL;DR

MulCLIP introduces a region-proposal-free, multi-level alignment framework that extends CLIP to long-form text by jointly optimizing global, local, and subcaption-level alignments. It incorporates Local Token Calibration, Token Reconstruction Alignment, and Subcaption-Aggregated Patch mechanisms to achieve fine-grained cross-modal grounding while maintaining strong global consistency. Comprehensive experiments show consistent improvements in long-caption retrieval and robust zero-shot transfer across diverse datasets, with ablations highlighting the complementary value of each component. The approach offers practical benefits for real-world vision-language tasks requiring detailed long-text understanding without heavy region proposals.

Abstract

Vision-language models like CLIP show impressive ability to align images and text, but their training on short, concise captions makes them struggle with lengthy, detailed descriptions. Recent advances mitigate this challenge by leveraging region-proposal information to map visual regions with corresponding sentences from lengthy captions, yet incurring notable deployment costs. We introduce MulCLIP, a novel end-to-end multi-level alignment framework that bridges natural long-text structures with image components. MulCLIP first preserves global contrastive alignment between images and both summary and long captions, while extending positional embeddings for longer text sequences. To further enhance fine-grained understanding, we propose two novel strategies: (1) a token reconstruction alignment over locally calibrated features to strengthen semantic connections between words and image patches, and (2) a subcaption-aggregated patch alignment that automatically extracts and aggregates context-rich patches for each subcaption. Experimental results across diverse benchmarks demonstrate our method consistently improves downstream performance, while ablation studies confirm its multi-scale alignment is the key factor driving better fine-grained capability than region-proposal-assisted approaches, making it particularly suitable for diverse real-world applications.

MulCLIP: A Multi-level Alignment Framework for Enhancing Fine-grained Long-context CLIP

TL;DR

MulCLIP introduces a region-proposal-free, multi-level alignment framework that extends CLIP to long-form text by jointly optimizing global, local, and subcaption-level alignments. It incorporates Local Token Calibration, Token Reconstruction Alignment, and Subcaption-Aggregated Patch mechanisms to achieve fine-grained cross-modal grounding while maintaining strong global consistency. Comprehensive experiments show consistent improvements in long-caption retrieval and robust zero-shot transfer across diverse datasets, with ablations highlighting the complementary value of each component. The approach offers practical benefits for real-world vision-language tasks requiring detailed long-text understanding without heavy region proposals.

Abstract

Vision-language models like CLIP show impressive ability to align images and text, but their training on short, concise captions makes them struggle with lengthy, detailed descriptions. Recent advances mitigate this challenge by leveraging region-proposal information to map visual regions with corresponding sentences from lengthy captions, yet incurring notable deployment costs. We introduce MulCLIP, a novel end-to-end multi-level alignment framework that bridges natural long-text structures with image components. MulCLIP first preserves global contrastive alignment between images and both summary and long captions, while extending positional embeddings for longer text sequences. To further enhance fine-grained understanding, we propose two novel strategies: (1) a token reconstruction alignment over locally calibrated features to strengthen semantic connections between words and image patches, and (2) a subcaption-aggregated patch alignment that automatically extracts and aggregates context-rich patches for each subcaption. Experimental results across diverse benchmarks demonstrate our method consistently improves downstream performance, while ablation studies confirm its multi-scale alignment is the key factor driving better fine-grained capability than region-proposal-assisted approaches, making it particularly suitable for diverse real-world applications.

Paper Structure

This paper contains 34 sections, 17 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Multi-granularity textual supervision and benchmark evolution. Using a shared visual concept, we illustrate four levels of text aligned with common benchmarks: (a) an ImageNet-style class label, (b) FG-OVD attribute phrases (✓ positive, ✗ hard negatives), (c) a COCO-style sentence, and (d) a DOCCI-like long description. These granularities reflect the shift from coarse labels to rich, fine-grained descriptions, providing stronger supervision for vision–language alignment.
  • Figure 2: Overview of MulCLIP. An image encoder (ViT) produces a global image embedding $\mathrm{CLS}_{img}$ and a sequence of local tokens $\mathrm{LOC}_{img}$. The text encoder outputs local tokens $\mathrm{LOC}_{text}$ and an end-of-text global embedding $\mathrm{EOT}_{text}$ for multiple textual inputs, including long captions, summary captions, and other sub-captions. Independent calibration modules refine and shorten the local sequences of image and long text into $v^{\prime}$ and $t^{\prime}$. MulCLIP further exploits these semantic tokens through token reconstruction and the subcaption–aggregated patch mechanism
  • Figure 3: Qualitative comparison of attention maps. From left to right, we show: (1) the original image, (2) the attention heatmap, and (3) the overlay of the heatmap on the image. Across diverse scenes, MulCLIP produces sharper and more semantically aligned attention, successfully localizing fine-grained details that are often missed or diluted in baseline methods. Red circles highlight regions where MulCLIP demonstrates effective attention localization.
  • Figure 4: Effect of maximum number of sentences on long-text retrieval (DOCCI / DCI / Urban1K).
  • Figure 5: Effect of maximum number of sentences on short-text retrieval (Flickr30K / COCO).
  • ...and 3 more figures