SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
Ziping Ma, Furong Xu, Jian Liu, Ming Yang, Qingpei Guo
TL;DR
SyCoCa addresses the limitation of unidirectional local interaction in previous vision-language pretraining by introducing bidirectional local interactions through a text-guided masked image modeling (TG-MIM) head and attentive patch masking. It extends CoCa with three objectives—image-text contrastive (ITC), image captioning (IC), and TG-MIM—within a four-component architecture (image encoder, text encoder, image decoder, text decoder) to align vision and language at global and local levels. End-to-end pretraining on CC12M demonstrates consistent gains across image-text retrieval, image captioning, VQA, and zero-shot/fine-tuned image classification, highlighting improved fine-grained cross-modal understanding. The approach emphasizes selective, text-guided interactions to cultivate a unified multimodal latent space with strong transfer to diverse downstream tasks.
Abstract
Multimodal alignment between language and vision is the fundamental topic in current vision-language model research. Contrastive Captioners (CoCa), as a representative method, integrates Contrastive Language-Image Pretraining (CLIP) and Image Caption (IC) into a unified framework, resulting in impressive results. CLIP imposes a bidirectional constraints on global representation of entire images and sentences. Although IC conducts an unidirectional image-to-text generation on local representation, it lacks any constraint on local text-to-image reconstruction, which limits the ability to understand images at a fine-grained level when aligned with texts. To achieve multimodal alignment from both global and local perspectives, this paper proposes Symmetrizing Contrastive Captioners (SyCoCa), which introduces bidirectional interactions on images and texts across the global and local representation levels. Specifically, we expand a Text-Guided Masked Image Modeling (TG-MIM) head based on ITC and IC heads. The improved SyCoCa can further leverage textual cues to reconstruct contextual images and visual cues to predict textual contents. When implementing bidirectional local interactions, the local contents of images tend to be cluttered or unrelated to their textual descriptions. Thus, we employ an attentive masking strategy to select effective image patches for interaction. Extensive experiments on five vision-language tasks, including image-text retrieval, image-captioning, visual question answering, and zero-shot/finetuned image classification, validate the effectiveness of our proposed method.
