Table of Contents
Fetching ...

CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension

Lihao Liu, Yan Wang, Biao Yang, Da Li, Jiangxia Cao, Yuxiao Luo, Xiang Chen, Xiangyu Wu, Wei Yuan, Fan Yang, Guiguang Ding, Tingting Gao, Guorui Zhou

TL;DR

This work proposes CREM (Compression-driven Representation Enhanced Model), with a unified framework that enhances multimodal representations for retrieval while preserving generative ability and highlights that generative supervision can further improve the representational quality of MLLMs under the proposed compression-driven paradigm.

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable success in comprehension tasks such as visual description and visual question answering. However, their direct application to embedding-based tasks like retrieval remains challenging due to the discrepancy between output formats and optimization objectives. Previous approaches often employ contrastive fine-tuning to adapt MLLMs for retrieval, but at the cost of losing their generative capabilities. We argue that both generative and embedding tasks fundamentally rely on shared cognitive mechanisms, specifically cross-modal representation alignment and contextual comprehension. To this end, we propose CREM (Compression-driven Representation Enhanced Model), with a unified framework that enhances multimodal representations for retrieval while preserving generative ability. Specifically, we introduce a compression-based prompt design with learnable chorus tokens to aggregate multimodal semantics and a compression-driven training strategy that integrates contrastive and generative objectives through compression-aware attention. Extensive experiments demonstrate that CREM achieves state-of-the-art retrieval performance on MMEB while maintaining strong generative performance on multiple comprehension benchmarks. Our findings highlight that generative supervision can further improve the representational quality of MLLMs under the proposed compression-driven paradigm.

CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension

TL;DR

This work proposes CREM (Compression-driven Representation Enhanced Model), with a unified framework that enhances multimodal representations for retrieval while preserving generative ability and highlights that generative supervision can further improve the representational quality of MLLMs under the proposed compression-driven paradigm.

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable success in comprehension tasks such as visual description and visual question answering. However, their direct application to embedding-based tasks like retrieval remains challenging due to the discrepancy between output formats and optimization objectives. Previous approaches often employ contrastive fine-tuning to adapt MLLMs for retrieval, but at the cost of losing their generative capabilities. We argue that both generative and embedding tasks fundamentally rely on shared cognitive mechanisms, specifically cross-modal representation alignment and contextual comprehension. To this end, we propose CREM (Compression-driven Representation Enhanced Model), with a unified framework that enhances multimodal representations for retrieval while preserving generative ability. Specifically, we introduce a compression-based prompt design with learnable chorus tokens to aggregate multimodal semantics and a compression-driven training strategy that integrates contrastive and generative objectives through compression-aware attention. Extensive experiments demonstrate that CREM achieves state-of-the-art retrieval performance on MMEB while maintaining strong generative performance on multiple comprehension benchmarks. Our findings highlight that generative supervision can further improve the representational quality of MLLMs under the proposed compression-driven paradigm.
Paper Structure (32 sections, 5 equations, 4 figures, 11 tables)

This paper contains 32 sections, 5 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Comparison of Different Paradigms. (a) Embedding models fall short on generation tasks. (b) Generative models lack retrieval capability. (c) Our proposed model CREM enables both in a single model.
  • Figure 2: Compression-Driven Training Framework of CREM. (a) The training pipeline integrates chorus tokens with contrastive and generative objectives under a unified prompt design equipped with compression-aware attention. Generation instructions and answers originate from diverse data sources. (b) Compression-aware attention mask enforcing token-level visibility constraints, where “+” indicates visible tokens and “–” indicates masked ones. (c) Two mixing strategies for generation training using different data sources. Homogeneous data are pseudo-labeled by an MLLM from retrieval pairs, whereas heterogeneous data are collected from open-source datasets.
  • Figure 3: CREM Inference Modes. (a) Retrieval embeddings are derived from pooled chorus tokens. (b) Native next-token prediction is performed with full access to all vision tokens (Nat.). (c) Efficient inference is achieved by pruning vision tokens and reducing KV caches (Comp.).
  • Figure 4: Visualization of Chorus Token Attention. We visualize the attention weights from chorus tokens to vision tokens. Each chorus token is assigned a unique color, and each vision token is colored based on its most attended chorus token, with color intensity reflecting attention strength.