Table of Contents
Fetching ...

Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video

Mengyao Xu, Wenfei Zhou, Yauhen Babakhin, Gabriel Moreira, Ronay Ak, Radek Osmulski, Bo Liu, Even Oldridge, Benedikt Schifferer

TL;DR

Omni-Embed-Nemotron tackles the challenge of retrieval across text, image, audio, and video by introducing a unified bi-encoder model built on the Qwen-Omni backbone. It uses contrastive learning with hard-negative mining to align a shared embedding space across modalities and supports cross-modal and joint-modal queries. Across video, image, and text benchmarks, the model shows strong open-domain video retrieval, competitive image retrieval, and solid text retrieval, with notable gains from in-domain fine-tuning and modality-aware preprocessing. The work demonstrates the feasibility and value of a single retrieval system capable of leveraging multimodal cues for scalable information access, while also highlighting areas for further improvement in modality-specific alignment and fusion strategies.

Abstract

We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding model developed to handle the increasing complexity of real-world information needs. While Retrieval-Augmented Generation (RAG) has significantly advanced language models by incorporating external knowledge, existing text-based retrievers rely on clean, structured input and struggle with the visually and semantically rich content found in real-world documents such as PDFs, slides, or videos. Recent work such as ColPali has shown that preserving document layout using image-based representations can improve retrieval quality. Building on this, and inspired by the capabilities of recent multimodal models such as Qwen2.5-Omni, we extend retrieval beyond text and images to also support audio and video modalities. Omni-Embed-Nemotron enables both cross-modal (e.g., text - video) and joint-modal (e.g., text - video+audio) retrieval using a single model. We describe the architecture, training setup, and evaluation results of Omni-Embed-Nemotron, and demonstrate its effectiveness in text, image, and video retrieval.

Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video

TL;DR

Omni-Embed-Nemotron tackles the challenge of retrieval across text, image, audio, and video by introducing a unified bi-encoder model built on the Qwen-Omni backbone. It uses contrastive learning with hard-negative mining to align a shared embedding space across modalities and supports cross-modal and joint-modal queries. Across video, image, and text benchmarks, the model shows strong open-domain video retrieval, competitive image retrieval, and solid text retrieval, with notable gains from in-domain fine-tuning and modality-aware preprocessing. The work demonstrates the feasibility and value of a single retrieval system capable of leveraging multimodal cues for scalable information access, while also highlighting areas for further improvement in modality-specific alignment and fusion strategies.

Abstract

We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding model developed to handle the increasing complexity of real-world information needs. While Retrieval-Augmented Generation (RAG) has significantly advanced language models by incorporating external knowledge, existing text-based retrievers rely on clean, structured input and struggle with the visually and semantically rich content found in real-world documents such as PDFs, slides, or videos. Recent work such as ColPali has shown that preserving document layout using image-based representations can improve retrieval quality. Building on this, and inspired by the capabilities of recent multimodal models such as Qwen2.5-Omni, we extend retrieval beyond text and images to also support audio and video modalities. Omni-Embed-Nemotron enables both cross-modal (e.g., text - video) and joint-modal (e.g., text - video+audio) retrieval using a single model. We describe the architecture, training setup, and evaluation results of Omni-Embed-Nemotron, and demonstrate its effectiveness in text, image, and video retrieval.

Paper Structure

This paper contains 18 sections, 1 equation, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Multimodal Retrieval Architecture with three input modalities: text, image, and video.
  • Figure 2: Bi-encoder retrieval system where both query and corpus are encoded by the same LLM, followed by a pooling layer. The resulting representations are then compared to compute a similarity score.
  • Figure 3: Comparison of two fusion strategies: (a) Qwen Omni model's interleaved fusion strategy, where audio and video tokens are organized sequentially and synchronized using TMRoPE. (b) Our retrieval model's separate-stream fusion strategy, where audio and video are encoded independently without token interleaving.