Table of Contents
Fetching ...

From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

Yeong-Joon Ju, Seong-Whan Lee

TL;DR

This paper introduces a hierarchical embedding prompt that provides strong latent conditioning, and presents Self-aware Hard Negative Sampling (SaHa), a highly data-efficient framework that bypasses extensive pre-training to build a robust multimodal representation space.

Abstract

Adapting generative Multimodal Large Language Models (MLLMs) into universal embedding models typically demands resource-intensive contrastive pre-training, while traditional hard negative mining methods suffer from severe false negative contamination. In this paper, we propose a highly data-efficient framework that bypasses extensive pre-training to build a robust multimodal representation space. We first introduce a hierarchical embedding prompt that provides strong latent conditioning. By explicitly anchoring task definitions at the system level, this prompting strategy effectively bridges the modality gap and unlocks powerful zero-shot embedding capabilities. Building upon this latent conditioning, we present Self-aware Hard Negative Sampling (SaHa). Unlike conventional candidate-space mining, SaHa shifts the mechanism to the query-space by mapping retrieved candidates back to their owner queries to rigorously filter out semantic false negatives. Furthermore, our method constructs mutually hard clusters, maximizing intra-task discrimination and batch efficiency without redundant forward passes. Extensive experiments demonstrate that our unified approach achieves highly competitive fine-tuning performance on the Massive Multimodal Embedding Benchmark using only a fraction of standard training data.

From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

TL;DR

This paper introduces a hierarchical embedding prompt that provides strong latent conditioning, and presents Self-aware Hard Negative Sampling (SaHa), a highly data-efficient framework that bypasses extensive pre-training to build a robust multimodal representation space.

Abstract

Adapting generative Multimodal Large Language Models (MLLMs) into universal embedding models typically demands resource-intensive contrastive pre-training, while traditional hard negative mining methods suffer from severe false negative contamination. In this paper, we propose a highly data-efficient framework that bypasses extensive pre-training to build a robust multimodal representation space. We first introduce a hierarchical embedding prompt that provides strong latent conditioning. By explicitly anchoring task definitions at the system level, this prompting strategy effectively bridges the modality gap and unlocks powerful zero-shot embedding capabilities. Building upon this latent conditioning, we present Self-aware Hard Negative Sampling (SaHa). Unlike conventional candidate-space mining, SaHa shifts the mechanism to the query-space by mapping retrieved candidates back to their owner queries to rigorously filter out semantic false negatives. Furthermore, our method constructs mutually hard clusters, maximizing intra-task discrimination and batch efficiency without redundant forward passes. Extensive experiments demonstrate that our unified approach achieves highly competitive fine-tuning performance on the Massive Multimodal Embedding Benchmark using only a fraction of standard training data.

Paper Structure

This paper contains 37 sections, 2 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: False negatives in hard negative mining. Since discriminative training relies on annotated pairs, valid descriptions (red) are treated as negatives simply since they are not explicitly paired with the query (green). Our SaHa strategy effectively filters these unpaired positives to prevent contradictory signals.
  • Figure 2: Overview of the Self-aware Hard Negative Sampling (SaHa) strategy. We first mine a broad candidate pool ($m \times k$) for a given anchor query. To prevent false negatives, candidates are mapped to their owner queries. By measuring similarity against the anchor query, we discard high-similarity pairs as potential false negatives and select the $k$ least similar ones as valid hard negatives. The selected samples form mutually hard clusters, optimizing both fine-grained intra-task discrimination and global inter-task separation within a single efficient batch.
  • Figure 3: UMAP visualization of the zero-shot embedding space.
  • Figure 4: Training dynamics and robustness analysis. We compare the Precision@1 scores of the standard user-prompt baseline (w/o SYS) and our hierarchical approach (Hier) across training iterations.
  • Figure 5: Hardness vs. Safety trade-off. Compared to HNS (red), SaHa (green) effectively lowers semantic overlap (y-axis) while preserving task difficulty (x-axis).
  • ...and 4 more figures