Table of Contents
Fetching ...

Contrastive Localized Language-Image Pre-Training

Hong-You Chen, Zhengfeng Lai, Haotian Zhang, Xinze Wang, Marcin Eichner, Keen You, Meng Cao, Bowen Zhang, Yinfei Yang, Zhe Gan

TL;DR

This work identifies a key limitation of CLIP: global image-text alignment often lacks fine-grained region understanding necessary for referring and grounding tasks in multimodal language models. It introduces Contrastive Localized Language-Image Pre-Training (CLOC), which adds a region-text contrastive objective and a lightweight Prompter to transform image embeddings into region-focused representations, enabling region-level zero-shot tasks. To support large-scale training, it presents the Visually-Enriched and Spatially-Localized (VESL) captioning pipeline, generating high-quality region-caption pairs by re-captioning images and applying open-vocabulary detection, yielding a two-billion image region-text dataset. Empirical results across 31 tasks, including MLLM-driven VQA and grounding benchmarks, show that CLOC consistently outperforms CLIP on region tasks while maintaining image-level performance, and can serve as a drop-in backbone for MLLMs, significantly enhancing fine-grained visual understanding and grounding capabilities.

Abstract

Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.

Contrastive Localized Language-Image Pre-Training

TL;DR

This work identifies a key limitation of CLIP: global image-text alignment often lacks fine-grained region understanding necessary for referring and grounding tasks in multimodal language models. It introduces Contrastive Localized Language-Image Pre-Training (CLOC), which adds a region-text contrastive objective and a lightweight Prompter to transform image embeddings into region-focused representations, enabling region-level zero-shot tasks. To support large-scale training, it presents the Visually-Enriched and Spatially-Localized (VESL) captioning pipeline, generating high-quality region-caption pairs by re-captioning images and applying open-vocabulary detection, yielding a two-billion image region-text dataset. Empirical results across 31 tasks, including MLLM-driven VQA and grounding benchmarks, show that CLOC consistently outperforms CLIP on region tasks while maintaining image-level performance, and can serve as a drop-in backbone for MLLMs, significantly enhancing fine-grained visual understanding and grounding capabilities.

Abstract

Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.
Paper Structure (37 sections, 4 equations, 4 figures, 6 tables)

This paper contains 37 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Our CLOC pre-training framework.(1) A visually-enriched and spatially-localized captioning pipeline pseudo-labels bounding boxes with detailed descriptions for key regions. (2) A lightweight Prompter attached on the CLIP image encoder can be prompted to transform the image embedding into region-focused features. All parameters are trained end-to-end from scratch with our contrastive localized language-image loss on the annotated region-text datasets. After pre-training, (3a) region features can be generated via the Prompter for region-text tasks like object classification in a training-free fashion. (3b) The image encoder, along with the optional Prompter, can also strengthen MLLMs fine-tuning by enhancing their fine-grained image understanding capabilities.
  • Figure 2: CLOC promptable embedding architecture.CLOC builds upon the image embedding from CLIP (before pooling and projection) and transforms it into a region-aware vision embedding given an encoded prompt; e.g., positional encodings of box coordinates or regional caption encoded by the CLIP text encoder.
  • Figure 3: Our Visually-Enriched and Spatially-Localized (VESL) captioning pipeline. We leverage an existing open-vocabulary detector (e.g., OWLv2) that predicts bounding boxes on the images and assigns the labels from the given text phrase candidates. Previous methods often use the alt-text attached to the images, which is prone to insufficient region descriptions. We found it crucial to re-caption images with the visually-enriched captioner VeCap lai2023scarcity for better visual concept exploitation for the detector.
  • Figure A: Examples comparing our VESL and the labeling approach in minderer2024scaling that directly uses the $n$-grams of the crawled AltText. For VESL, each image is annotated with the visual-enriched caption to replace the AltText, which is used to generate region text candidates that capture the image content better.