Table of Contents
Fetching ...

IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval

Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, Chaochao Lu

TL;DR

This work defines Instance-driven Multimodal Retrieval (IDMR), a task that requires retrieving an image containing the same object instance as a query image while matching a text-described context. It introduces IDMR-bench, built from real-world object tracking and first-person video data, and a scalable cross-domain data synthesis pipeline that yields 557K training triplets to train an MLLM-based retriever on 1.2M samples. The proposed model surpasses state-of-the-art baselines on in-domain, zero-shot, and MMEB benchmarks, demonstrating the effectiveness of instance-level, context-aware retrieval and the potential of multimodal LLMs for such tasks. The approach emphasizes data scale, cross-domain synthesis, and efficient fine-tuning (LoRA) to achieve strong generalization across diverse domains, with plans to scale further and release datasets and models publicly. This advances practical instance-aware retrieval for embodied AI and multimedia content pipelines.

Abstract

Multimodal retrieval systems are becoming increasingly vital for cutting-edge AI technologies, such as embodied AI and AI-driven digital content industries. However, current multimodal retrieval tasks lack sufficient complexity and demonstrate limited practical application value. It spires us to design Instance-Driven Multimodal Image Retrieval (IDMR), a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario. Unlike existing retrieval tasks focused on global image similarity or category-level matching, IDMR demands fine-grained instance-level consistency across diverse contexts. To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data. Addressing the scarcity of training data, we propose a cross-domain synthesis method that creates 557K training samples by cropping objects from standard detection datasets. Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench. Experimental results demonstrate previous models' limitations in instance-aware retrieval and highlight the potential of MLLM for advanced retrieval applications. The whole training dataset, codes and models, with wide ranges of sizes, are available at https://github.com/BwLiu01/IDMR.

IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval

TL;DR

This work defines Instance-driven Multimodal Retrieval (IDMR), a task that requires retrieving an image containing the same object instance as a query image while matching a text-described context. It introduces IDMR-bench, built from real-world object tracking and first-person video data, and a scalable cross-domain data synthesis pipeline that yields 557K training triplets to train an MLLM-based retriever on 1.2M samples. The proposed model surpasses state-of-the-art baselines on in-domain, zero-shot, and MMEB benchmarks, demonstrating the effectiveness of instance-level, context-aware retrieval and the potential of multimodal LLMs for such tasks. The approach emphasizes data scale, cross-domain synthesis, and efficient fine-tuning (LoRA) to achieve strong generalization across diverse domains, with plans to scale further and release datasets and models publicly. This advances practical instance-aware retrieval for embodied AI and multimedia content pipelines.

Abstract

Multimodal retrieval systems are becoming increasingly vital for cutting-edge AI technologies, such as embodied AI and AI-driven digital content industries. However, current multimodal retrieval tasks lack sufficient complexity and demonstrate limited practical application value. It spires us to design Instance-Driven Multimodal Image Retrieval (IDMR), a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario. Unlike existing retrieval tasks focused on global image similarity or category-level matching, IDMR demands fine-grained instance-level consistency across diverse contexts. To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data. Addressing the scarcity of training data, we propose a cross-domain synthesis method that creates 557K training samples by cropping objects from standard detection datasets. Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench. Experimental results demonstrate previous models' limitations in instance-aware retrieval and highlight the potential of MLLM for advanced retrieval applications. The whole training dataset, codes and models, with wide ranges of sizes, are available at https://github.com/BwLiu01/IDMR.

Paper Structure

This paper contains 25 sections, 4 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Examples of Instance-driven Multimodal Image Retrieval task. The positive image should contain the same instance as in the query and also comply with the query text. The negative image only adheres to the query text but does not contain the same instance present in the query.
  • Figure 2: Comparison of IDMR-bench with other benchmarks. We focus on Instance-level retrieval and query describing complex scenes. Zoom in best.
  • Figure 3: Data construction pipeline. The training data (top) and zero-shot benchmark data (bottom) are from different sources.
  • Figure 4: Distributions of bounding boxes in the IDMR training dataset before and after filtering.
  • Figure 5: Scaling experiments: (a) left: performance with different numbers of training data and (b) right: performance with different sizes of InternVL2.5.
  • ...and 2 more figures