Table of Contents
Fetching ...

Scaling Language-Centric Omnimodal Representation Learning

Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong

TL;DR

This work investigates why language-centric, generative pretraining of multimodal LLMs yields superior omnimodal embeddings. It reveals latent cross-modal alignment arises during generative training, which lightweight text-only contrastive refinement can unlock without destroying pretraining benefits, leading to LCO-Emb. The authors formalize a Generation-Representation Scaling Law (GRSL) and provide a PAC-Bayesian bound linking generative quality to representational bounds, with SeaDoc demonstrating gains in low-resource languages. Empirically, LCO-Emb achieves state-of-the-art results on MIEB-Lite with minimal data and shows strong performance across vision, audio, and video modalities, validating the central thesis that generative capability sets upper bounds on representational potential. Overall, the work reframes CL as a lightweight activation mechanism that preserves latent cross-modal alignment, enabling scalable, robust multimodal representations across languages and modalities.

Abstract

Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.

Scaling Language-Centric Omnimodal Representation Learning

TL;DR

This work investigates why language-centric, generative pretraining of multimodal LLMs yields superior omnimodal embeddings. It reveals latent cross-modal alignment arises during generative training, which lightweight text-only contrastive refinement can unlock without destroying pretraining benefits, leading to LCO-Emb. The authors formalize a Generation-Representation Scaling Law (GRSL) and provide a PAC-Bayesian bound linking generative quality to representational bounds, with SeaDoc demonstrating gains in low-resource languages. Empirically, LCO-Emb achieves state-of-the-art results on MIEB-Lite with minimal data and shows strong performance across vision, audio, and video modalities, validating the central thesis that generative capability sets upper bounds on representational potential. Overall, the work reframes CL as a lightweight activation mechanism that preserves latent cross-modal alignment, enabling scalable, robust multimodal representations across languages and modalities.

Abstract

Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.

Paper Structure

This paper contains 30 sections, 2 theorems, 15 equations, 7 figures, 4 tables.

Key Result

Theorem 1

Let $P$ be a generative prior of a pre-trained autoregressive generative model and $Q$ the posterior of the generative model after contrastive fine-tuning on a dataset of $n$ samples. Under Hypothesis hyp:warm-start, with probability at least $1-\delta$ over the draw of the training set, the expecte

Figures (7)

  • Figure 1: The anisotropy estimates of Qwen2.5-Omni-3B embeddings across text, image, audio, and video modalities. The vanilla model exhibits typical representation degeneration (anisotropy) for all modalities. After applying text-only contrastive learning, embeddings across modalities become more isotropic, indicating latent language-centric cross-modal alignment within the model.
  • Figure 2: Layer-wise vision-language kernel alignment before and after text-only contrastive learning, evaluated on Qwen-VL models with 7B (28 layers) and 3B (36 layers) parameters. Note the 3B model has more layers than the 7B model.
  • Figure 3: The power of language-centric omnimodal representation learning: Before text-only contrastive learning (CL), representations across modalities in multimodal large language models (MLLMs) exhibit anisotropy, collapsing into a confined subspace. Text-only CL disperses textual representations by increasing their separation, effectively reducing anisotropy. Notably, this process generalizes to alleviate anisotropy in non-textual modalities, despite the absence of direct supervision.
  • Figure 4: Performance comparison of LCO-Emb against the state-of-the-art open-source and proprietary embedding models, where we visualize the average performance of MIEB-Lite and its English-only subsets. LCO-Emb-VL and LCO-Emb-Omni denotes LCO-Emb trained from the Qwen2.5-VL and Qwen2.5-Omni backbones, respectively, while "T" and "M" represent the text-only and multimodal variants of LCO-Emb, respectively.
  • Figure 5: Ablation comparison between the text-only variants of LCO-Emb with advanced open-source (E5-V jiang2024e5v) and proprietary (Voyage Multimodal 3 voyage3) embedding models on MIEB-Sub18. LCO-Emb-VL and LCO-Emb-Omni denote LCO-Emb trained from Qwen2.5-VL and Qwen2.5-Omni backbones, respectively.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Definition 1: Population and Empirical Risk
  • Definition 2: Generative Quality of the Prior
  • Theorem 1: Generative-Contrastive PAC-Bayes Bound
  • proof
  • Corollary 1: Generative Performance Governs Representation Bound
  • Definition 3: Generative Quality of the Prior