Table of Contents
Fetching ...

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, Wenhu Chen

TL;DR

UniIR introduces a universal multimodal information retriever trained with instruction tuning to handle eight retrieval tasks across modalities. It builds M-BEIR, a large-scale benchmark of 10 datasets across 8 tasks with instruction-based queries and a 5.6M candidate pool, enabling standardized evaluation of cross-modal retrieval. The study demonstrates that instruction tuning and multi-task training substantially boost generalization to unseen tasks and held-out datasets, while architecture alignment with pre-training further enhances performance. Together, these elements establish a strong baseline for universal multimodal retrieval and highlight directions for scaling vision-language pretraining to broader retrieval tasks.

Abstract

Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image. To approach such different information-seeking demands, we introduce UniIR, a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities. UniIR, a single retrieval system jointly trained on ten diverse multimodal-IR datasets, interprets user instructions to execute various retrieval tasks, demonstrating robust performance across existing datasets and zero-shot generalization to new tasks. Our experiments highlight that multi-task training and instruction tuning are keys to UniIR's generalization ability. Additionally, we construct the M-BEIR, a multimodal retrieval benchmark with comprehensive results, to standardize the evaluation of universal multimodal information retrieval.

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

TL;DR

UniIR introduces a universal multimodal information retriever trained with instruction tuning to handle eight retrieval tasks across modalities. It builds M-BEIR, a large-scale benchmark of 10 datasets across 8 tasks with instruction-based queries and a 5.6M candidate pool, enabling standardized evaluation of cross-modal retrieval. The study demonstrates that instruction tuning and multi-task training substantially boost generalization to unseen tasks and held-out datasets, while architecture alignment with pre-training further enhances performance. Together, these elements establish a strong baseline for universal multimodal retrieval and highlight directions for scaling vision-language pretraining to broader retrieval tasks.

Abstract

Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image. To approach such different information-seeking demands, we introduce UniIR, a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities. UniIR, a single retrieval system jointly trained on ten diverse multimodal-IR datasets, interprets user instructions to execute various retrieval tasks, demonstrating robust performance across existing datasets and zero-shot generalization to new tasks. Our experiments highlight that multi-task training and instruction tuning are keys to UniIR's generalization ability. Additionally, we construct the M-BEIR, a multimodal retrieval benchmark with comprehensive results, to standardize the evaluation of universal multimodal information retrieval.
Paper Structure (57 sections, 3 equations, 18 figures, 13 tables)

This paper contains 57 sections, 3 equations, 18 figures, 13 tables.

Figures (18)

  • Figure 1: We build a universal multimodal information retriever UniIR through instruction tuning. UniIR is capable of accepting any form of query and instruction to retrieve information in any modality.
  • Figure 2: (a) Score-level fusion encodes each modality into a single feature; (b) CLIP feature-level fusion (CLIP$_{FF}$) fuses two modalities into a single feature with a mix-modality transformer layer; (c) BLIP feature-level fusion (BLIP$_{FF}$) adopts cross-attention to output a single feature vector.
  • Figure 3: Examples of six query instances in the M-BEIR dataset. Each example query instance includes a query $\textbf{q}$, a human-annotated natural language instruction $q_{\text{inst}}$, and a positive(relevant) candidate $\mathbf{c}^+$.
  • Figure 4: Visualization of top 5 retrieved candidates from M-BEIR with 3 models on EDIS. Without instructions, zero-shot and multi-task training models mostly retrieve the wrong modality (text-only). UniIR retrieves candidates accurately with the right modality (image, text).
  • Figure 5: Held-out dataset generalization experiments on M-BEIR: we train a Multi-task and a UniIR model on 7 held-in datasets and test on 3 held-out datasets (WebQA, OVEN, CIRR) from the M-BEIR. Results are averaged over CLIP$_{\text{SF}}$ and BLIP$_{\text{FF}}$.
  • ...and 13 more figures