Table of Contents
Fetching ...

X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

Sirnam Swetha, Jinyu Yang, Tal Neiman, Mamshad Nayeem Rizve, Son Tran, Benjamin Yao, Trishul Chilimbi, Mubarak Shah

TL;DR

The paper addresses the limitation of CLIP-based vision encoders in capturing fine-grained visual details for multimodal language models. It introduces X-Former, a lightweight transformer that fuses frozen CLIP-ViT ($C$) and MAE-ViT ($M$) via dual cross-attention, enabling both global and local visual representations. Through a two-stage training regime—Stage 1 pre-training with ITC/ITM/ITG and image reconstruction, followed by Stage 2 LLM alignment—X-Former achieves strong zero-shot VQA, fine-grained visual perception, and captioning performance, surpassing BLIP-2 while using substantially less data and no instruction tuning. The approach demonstrates efficient, scalable learning of detailed visual features, offering practical impact for building capable, data-efficient Multimodal LLMs.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have revolutionized the field of vision-language understanding by integrating visual perception capabilities into Large Language Models (LLMs). The prevailing trend in this field involves the utilization of a vision encoder derived from vision-language contrastive learning (CL), showing expertise in capturing overall representations while facing difficulties in capturing detailed local patterns. In this work, we focus on enhancing the visual representations for MLLMs by combining high-frequency and detailed visual representations, obtained through masked image modeling (MIM), with semantically-enriched low-frequency representations captured by CL. To achieve this goal, we introduce X-Former which is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM through an innovative interaction mechanism. Specifically, X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders, i.e., CLIP-ViT (CL-based) and MAE-ViT (MIM-based). It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM. To demonstrate the effectiveness of our approach, we assess its performance on tasks demanding detailed visual understanding. Extensive evaluations indicate that X-Former excels in visual reasoning tasks involving both structural and semantic categories in the GQA dataset. Assessment on fine-grained visual perception benchmark further confirms its superior capabilities in visual understanding.

X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

TL;DR

The paper addresses the limitation of CLIP-based vision encoders in capturing fine-grained visual details for multimodal language models. It introduces X-Former, a lightweight transformer that fuses frozen CLIP-ViT () and MAE-ViT () via dual cross-attention, enabling both global and local visual representations. Through a two-stage training regime—Stage 1 pre-training with ITC/ITM/ITG and image reconstruction, followed by Stage 2 LLM alignment—X-Former achieves strong zero-shot VQA, fine-grained visual perception, and captioning performance, surpassing BLIP-2 while using substantially less data and no instruction tuning. The approach demonstrates efficient, scalable learning of detailed visual features, offering practical impact for building capable, data-efficient Multimodal LLMs.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have revolutionized the field of vision-language understanding by integrating visual perception capabilities into Large Language Models (LLMs). The prevailing trend in this field involves the utilization of a vision encoder derived from vision-language contrastive learning (CL), showing expertise in capturing overall representations while facing difficulties in capturing detailed local patterns. In this work, we focus on enhancing the visual representations for MLLMs by combining high-frequency and detailed visual representations, obtained through masked image modeling (MIM), with semantically-enriched low-frequency representations captured by CL. To achieve this goal, we introduce X-Former which is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM through an innovative interaction mechanism. Specifically, X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders, i.e., CLIP-ViT (CL-based) and MAE-ViT (MIM-based). It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM. To demonstrate the effectiveness of our approach, we assess its performance on tasks demanding detailed visual understanding. Extensive evaluations indicate that X-Former excels in visual reasoning tasks involving both structural and semantic categories in the GQA dataset. Assessment on fine-grained visual perception benchmark further confirms its superior capabilities in visual understanding.
Paper Structure (39 sections, 14 figures, 14 tables)

This paper contains 39 sections, 14 figures, 14 tables.

Figures (14)

  • Figure 1: (a) Vanilla Q-Former extracts a fixed number of output features $Z'$ from the CLIP image encoder, where $C$ and $z$ denotes CLIP-ViT's image features and the query input, respectively; (b) Concatenated MAE-ViT $(M^{*})$ and CLIP-ViT $(C)$ features are passed as input to Q-Former, (c) A Cross-Attention layer is added in L2 to enable MAE-ViT interaction in Q-Former.
  • Figure 2: Performance comparison of BLIP2, BLIP2+Concatenation, BLIP2+Early Cross-Attention, and our method on VQAv2 (a), GQA (b), and OKVQA (c) datasets.
  • Figure 3: An overview of X-Former which extends Q-Former by introducing a dual cross-attention module to capture both local and global visual features. First it computes CLIP visual features $(C)$ and MAE features $(M)$ (with random masking) from the input image-text pair. Q-Former employs $C, Z,$ Text to generate output queries optimized for three objectives - ITC, ITM and ITG. The proposed block (purple) enriches Q-Former global representation $(Z_q)$ with local information from MAE features $(M)$. Initially, $M$ is aligned and enriched by $Z_q$ resulting in enriched MAE representation $(M')$, optimized for image reconstruction. Then, $M'$ enhances $Z_q$ with local representations through cross-attentions, optimized using VL objectives. Jointly optimizing these four objectives facilitates the learning of both global and local representations.
  • Figure 4: LLM Alignment. X-Former queries are aligned with a frozen decoder-based LLM. FC layer adapts the query output$(Z^{\prime})$ to LLM embedding space.
  • Figure 5: Detailed Comparison for both Structural and Semantic categories in GQA.
  • ...and 9 more figures