TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

Yebo Wu; Feng Liu; Ziwei Xie; Zhiyuan Liu; Changwang Zhang; Jun Wang; Li Li

TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

Yebo Wu, Feng Liu, Ziwei Xie, Zhiyuan Liu, Changwang Zhang, Jun Wang, Li Li

TL;DR

TSEmbed is proposed, a universal multimodal embedding framework that synergizes Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to explicitly disentangle conflicting task objectives and Expert-Aware Negative Sampling (EANS) is introduced, a novel strategy that leverages expert routing distributions as an intrinsic proxy for semantic similarity.

Abstract

Despite the exceptional reasoning capabilities of Multimodal Large Language Models (MLLMs), their adaptation into universal embedding models is significantly impeded by task conflict. To address this, we propose TSEmbed, a universal multimodal embedding framework that synergizes Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to explicitly disentangle conflicting task objectives. Moreover, we introduce Expert-Aware Negative Sampling (EANS), a novel strategy that leverages expert routing distributions as an intrinsic proxy for semantic similarity. By dynamically prioritizing informative hard negatives that share expert activation patterns with the query, EANS effectively sharpens the model's discriminative power and refines embedding boundaries. To ensure training stability, we further devise a two-stage learning paradigm that solidifies expert specialization before optimizing representations via EANS. TSEmbed achieves state-of-the-art performance on both the Massive Multimodal Embedding Benchmark (MMEB) and real-world industrial production datasets, laying a foundation for task-level scaling in universal multimodal embeddings.

TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

TL;DR

Abstract

Paper Structure (20 sections, 10 equations, 9 figures, 3 tables)

This paper contains 20 sections, 10 equations, 9 figures, 3 tables.

Introduction
Anatomy of Task Conflict in Multimodal Embeddings
Spatial Dimension: Divergent Gradient Trajectories
Temporal Dimension: Heterogeneous Convergence Dynamics
Ecological Dimension: Data Imbalance and Task Dominance
TSEmbed: Task Scaling Multimodal Embeddings
Preliminary
Conflict Decoupling: MoE-LoRA
Boundary Refinement: Expert-Aware Negative Sampling
Two-Stage Learning Paradigm
Experiments
Experimental Setup
Baselines
Main Results
Generalization and Efficiency Analysis
...and 5 more sections

Figures (9)

Figure 1: Impact of task conflict on model performance. Red annotations ($\downarrow$) indicate the performance drop when switching from task-specific models to the unified VLM2VEC.
Figure 2: A Multidimensional Anatomy of Task Conflict in Monolithic Adapters. (a) Divergent optimization trajectories of isolated task-specific adapters. (b) Heterogeneous convergence dynamics of individual tasks during training. (c) Layer-wise cosine similarity between the jointly trained adapter and its isolated counterparts.
Figure 2: Performance evaluation on proprietary production datasets sourced from a large-scale technology enterprise.
Figure 3: Overview of existing work and our proposed TSEmbed.
Figure 4: Efficiency analysis of TSEmbed.
...and 4 more figures

TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

TL;DR

Abstract

TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (9)