Table of Contents
Fetching ...

Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs

Qi Li, Yanzhe Zhao, Yongxin Zhou, Yameng Wang, Yandong Yang, Yuanjia Zhou, Jue Wang, Zuojian Wang, Jinxiang Liu

TL;DR

This paper tackles the high computational cost of universal multimodal embedding with MLLMs by introducing Magic-MM-Embedding, a token-efficient architecture that compresses visual tokens via a parameter-free interpolation and couples it with a three-stage progressive training regime (generative restoration, contrastive pretraining with hard negatives, and task-aware finetuning guided by an MLLM as a judge). A synergistic reranker is trained on judge-curated data to further boost retrieval accuracy. Across natural image and visual-document benchmarks, the method achieves state-of-the-art results while using only a fraction of the visual tokens, demonstrating that token efficiency and high performance can be achieved together. The approach provides practical implications for scalable, low-latency multimodal retrieval in real-world systems.

Abstract

Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the substantial computational cost incurred from processing a large number of tokens from visual inputs. In this paper, we propose Magic-MM-Embedding, a series of novel models that achieve both high efficiency and state-of-the-art performance in universal multimodal embedding. Our approach is built on two synergistic pillars: (1) a highly efficient MLLM architecture incorporating visual token compression to drastically reduce inference latency and memory footprint, and (2) a multi-stage progressive training strategy designed to not only recover but significantly boost performance. This coarse-to-fine training paradigm begins with extensive continue pretraining to restore multimodal understanding and generation capabilities, progresses to large-scale contrastive pretraining and hard negative mining to enhance discriminative power, and culminates in a task-aware fine-tuning stage guided by an MLLM-as-a-Judge for precise data curation. Comprehensive experiments show that our model outperforms existing methods by a large margin while being more inference-efficient.

Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs

TL;DR

This paper tackles the high computational cost of universal multimodal embedding with MLLMs by introducing Magic-MM-Embedding, a token-efficient architecture that compresses visual tokens via a parameter-free interpolation and couples it with a three-stage progressive training regime (generative restoration, contrastive pretraining with hard negatives, and task-aware finetuning guided by an MLLM as a judge). A synergistic reranker is trained on judge-curated data to further boost retrieval accuracy. Across natural image and visual-document benchmarks, the method achieves state-of-the-art results while using only a fraction of the visual tokens, demonstrating that token efficiency and high performance can be achieved together. The approach provides practical implications for scalable, low-latency multimodal retrieval in real-world systems.

Abstract

Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the substantial computational cost incurred from processing a large number of tokens from visual inputs. In this paper, we propose Magic-MM-Embedding, a series of novel models that achieve both high efficiency and state-of-the-art performance in universal multimodal embedding. Our approach is built on two synergistic pillars: (1) a highly efficient MLLM architecture incorporating visual token compression to drastically reduce inference latency and memory footprint, and (2) a multi-stage progressive training strategy designed to not only recover but significantly boost performance. This coarse-to-fine training paradigm begins with extensive continue pretraining to restore multimodal understanding and generation capabilities, progresses to large-scale contrastive pretraining and hard negative mining to enhance discriminative power, and culminates in a task-aware fine-tuning stage guided by an MLLM-as-a-Judge for precise data curation. Comprehensive experiments show that our model outperforms existing methods by a large margin while being more inference-efficient.
Paper Structure (17 sections, 6 equations, 6 figures, 14 tables)

This paper contains 17 sections, 6 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Breaking the efficiency-performance trade-off for MLLM embedders for universal multimodal retrieval. (a) Standard MLLM-based embedders suffer from high computational costs due to processing redundant, dense visual token sequences. (b) We propose a visual token compression model paired with a robust three-stage progressive training strategy. (c) Comparisons on MMEB VLM2Vec demonstrate that our approach establishes a new state-of-the-art using much less visual tokens with reduced inference latency.
  • Figure 2: Overview of the proposed visual-token-efficient architecture for universal multimodal retrieval. (a) The proposed MLLM architecture with Visual Token Compression, InternVL3-VTC. (b, c) The proposed inference-efficient, universal multimodal embedder and reranker, both of which are built upon InternVL3-VTC.
  • Figure :
  • Figure :
  • Figure :
  • ...and 1 more figures