Table of Contents
Fetching ...

SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment

Ziping Ma, Furong Xu, Jian Liu, Ming Yang, Qingpei Guo

TL;DR

SyCoCa addresses the limitation of unidirectional local interaction in previous vision-language pretraining by introducing bidirectional local interactions through a text-guided masked image modeling (TG-MIM) head and attentive patch masking. It extends CoCa with three objectives—image-text contrastive (ITC), image captioning (IC), and TG-MIM—within a four-component architecture (image encoder, text encoder, image decoder, text decoder) to align vision and language at global and local levels. End-to-end pretraining on CC12M demonstrates consistent gains across image-text retrieval, image captioning, VQA, and zero-shot/fine-tuned image classification, highlighting improved fine-grained cross-modal understanding. The approach emphasizes selective, text-guided interactions to cultivate a unified multimodal latent space with strong transfer to diverse downstream tasks.

Abstract

Multimodal alignment between language and vision is the fundamental topic in current vision-language model research. Contrastive Captioners (CoCa), as a representative method, integrates Contrastive Language-Image Pretraining (CLIP) and Image Caption (IC) into a unified framework, resulting in impressive results. CLIP imposes a bidirectional constraints on global representation of entire images and sentences. Although IC conducts an unidirectional image-to-text generation on local representation, it lacks any constraint on local text-to-image reconstruction, which limits the ability to understand images at a fine-grained level when aligned with texts. To achieve multimodal alignment from both global and local perspectives, this paper proposes Symmetrizing Contrastive Captioners (SyCoCa), which introduces bidirectional interactions on images and texts across the global and local representation levels. Specifically, we expand a Text-Guided Masked Image Modeling (TG-MIM) head based on ITC and IC heads. The improved SyCoCa can further leverage textual cues to reconstruct contextual images and visual cues to predict textual contents. When implementing bidirectional local interactions, the local contents of images tend to be cluttered or unrelated to their textual descriptions. Thus, we employ an attentive masking strategy to select effective image patches for interaction. Extensive experiments on five vision-language tasks, including image-text retrieval, image-captioning, visual question answering, and zero-shot/finetuned image classification, validate the effectiveness of our proposed method.

SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment

TL;DR

SyCoCa addresses the limitation of unidirectional local interaction in previous vision-language pretraining by introducing bidirectional local interactions through a text-guided masked image modeling (TG-MIM) head and attentive patch masking. It extends CoCa with three objectives—image-text contrastive (ITC), image captioning (IC), and TG-MIM—within a four-component architecture (image encoder, text encoder, image decoder, text decoder) to align vision and language at global and local levels. End-to-end pretraining on CC12M demonstrates consistent gains across image-text retrieval, image captioning, VQA, and zero-shot/fine-tuned image classification, highlighting improved fine-grained cross-modal understanding. The approach emphasizes selective, text-guided interactions to cultivate a unified multimodal latent space with strong transfer to diverse downstream tasks.

Abstract

Multimodal alignment between language and vision is the fundamental topic in current vision-language model research. Contrastive Captioners (CoCa), as a representative method, integrates Contrastive Language-Image Pretraining (CLIP) and Image Caption (IC) into a unified framework, resulting in impressive results. CLIP imposes a bidirectional constraints on global representation of entire images and sentences. Although IC conducts an unidirectional image-to-text generation on local representation, it lacks any constraint on local text-to-image reconstruction, which limits the ability to understand images at a fine-grained level when aligned with texts. To achieve multimodal alignment from both global and local perspectives, this paper proposes Symmetrizing Contrastive Captioners (SyCoCa), which introduces bidirectional interactions on images and texts across the global and local representation levels. Specifically, we expand a Text-Guided Masked Image Modeling (TG-MIM) head based on ITC and IC heads. The improved SyCoCa can further leverage textual cues to reconstruct contextual images and visual cues to predict textual contents. When implementing bidirectional local interactions, the local contents of images tend to be cluttered or unrelated to their textual descriptions. Thus, we employ an attentive masking strategy to select effective image patches for interaction. Extensive experiments on five vision-language tasks, including image-text retrieval, image-captioning, visual question answering, and zero-shot/finetuned image classification, validate the effectiveness of our proposed method.
Paper Structure (14 sections, 6 equations, 3 figures, 10 tables)

This paper contains 14 sections, 6 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Comparison of the pipelines in vision-language pretraining frameworks. (a) CLIP: aligning global features across modalities through contrastive learning. (b) CoCa: introducing image captioning to construct unidirectional fine-grained interaction. (c) Our SyCoCa: bidirectional local interation with attentive masking to enhance comprehensive cross-modal understanding.
  • Figure 2: The detailed illustration of our proposed method. The framework of our method consist of four modules: an image encoder, a (causal) text encoder, a (text-to-)image decoder, a (image-to-)text decoder. Our method focuses on three pretraining objectives: image-text contrasting (ITC), text-guided masked image modeling (TG-MIM) and image captioning (IC).
  • Figure 3: Qualitative analysis of the proposed SyCoCa. We visualize the attention localization map of the first convolution layer in image encoder by the toolkit Grad-CAM.