Table of Contents
Fetching ...

Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality

Jiahan Chen, Da Li, Hengran Zhang, Yinqiong Cai, Lixin Su, Jiafeng Guo, Daiting Shi, Dawei Yin, Keping Bi

TL;DR

Results validate that content reconstruction serves as an effective strategy to maximize the value of existing data, enabling multimodal embedding models generate compact and informative representations, raising their performance ceiling.

Abstract

Multimodal embedding models, rooted in multimodal large language models (MLLMs), have yielded significant performance improvements across diverse tasks such as retrieval and classification. However, most existing approaches rely heavily on large-scale contrastive learning, with limited exploration of how the architectural and training paradigms of MLLMs affect embedding quality. While effective for generation, the causal attention and next-token prediction paradigm of MLLMs does not explicitly encourage the formation of globally compact representations, limiting their effectiveness as multimodal embedding backbones. To address this, we propose CoCoA, a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization. Specifically, we restructure the attention flow and introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding <EOS> embeddings. This drives the multimodal model to compress the semantic information of the input into the <EOS> token, laying the foundations for subsequent contrastive learning. Extensive experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality. Results validate that content reconstruction serves as an effective strategy to maximize the value of existing data, enabling multimodal embedding models generate compact and informative representations, raising their performance ceiling.

Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality

TL;DR

Results validate that content reconstruction serves as an effective strategy to maximize the value of existing data, enabling multimodal embedding models generate compact and informative representations, raising their performance ceiling.

Abstract

Multimodal embedding models, rooted in multimodal large language models (MLLMs), have yielded significant performance improvements across diverse tasks such as retrieval and classification. However, most existing approaches rely heavily on large-scale contrastive learning, with limited exploration of how the architectural and training paradigms of MLLMs affect embedding quality. While effective for generation, the causal attention and next-token prediction paradigm of MLLMs does not explicitly encourage the formation of globally compact representations, limiting their effectiveness as multimodal embedding backbones. To address this, we propose CoCoA, a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization. Specifically, we restructure the attention flow and introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding <EOS> embeddings. This drives the multimodal model to compress the semantic information of the input into the <EOS> token, laying the foundations for subsequent contrastive learning. Extensive experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality. Results validate that content reconstruction serves as an effective strategy to maximize the value of existing data, enabling multimodal embedding models generate compact and informative representations, raising their performance ceiling.
Paper Structure (20 sections, 4 equations, 6 figures, 5 tables)

This paper contains 20 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of CoCoA. Stage 1: Bidirectional attention warm-up using MAE and MNTP for different modalities; Stage 2: EOS-bridged caption or answer tokens reconstruction forces the multi-modalities' semantics to be compressed into a single token, i.e., $\langle \mathrm{EOS} \rangle$; Stage 3: Contrastive learning using $\langle \mathrm{EOS} \rangle$ .
  • Figure 2: An example of synthetic data. The fine-grained details are highlighted in bold to emphasize that these elements are newly introduced by the synthetic data generation process and are not explicitly present in the original data.
  • Figure 3: Performance under different mask ratios. We adopt three different mask ratios: 20%, 50%, and 70% from left to right.
  • Figure 4: Performance under different pre-training data volumes. Baseline is VLM2Vec based on Qwen2-VL-2B. 100K-400K denote the amounts of MMEB-V1 original data, and "+" indicates additional synthetic data.
  • Figure 5: The effect of pretraining data scale on In-Domain (IND) and Out-of-Domain (OOD) performance.
  • ...and 1 more figures