IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval
Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, Chaochao Lu
TL;DR
This work defines Instance-driven Multimodal Retrieval (IDMR), a task that requires retrieving an image containing the same object instance as a query image while matching a text-described context. It introduces IDMR-bench, built from real-world object tracking and first-person video data, and a scalable cross-domain data synthesis pipeline that yields 557K training triplets to train an MLLM-based retriever on 1.2M samples. The proposed model surpasses state-of-the-art baselines on in-domain, zero-shot, and MMEB benchmarks, demonstrating the effectiveness of instance-level, context-aware retrieval and the potential of multimodal LLMs for such tasks. The approach emphasizes data scale, cross-domain synthesis, and efficient fine-tuning (LoRA) to achieve strong generalization across diverse domains, with plans to scale further and release datasets and models publicly. This advances practical instance-aware retrieval for embodied AI and multimedia content pipelines.
Abstract
Multimodal retrieval systems are becoming increasingly vital for cutting-edge AI technologies, such as embodied AI and AI-driven digital content industries. However, current multimodal retrieval tasks lack sufficient complexity and demonstrate limited practical application value. It spires us to design Instance-Driven Multimodal Image Retrieval (IDMR), a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario. Unlike existing retrieval tasks focused on global image similarity or category-level matching, IDMR demands fine-grained instance-level consistency across diverse contexts. To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data. Addressing the scarcity of training data, we propose a cross-domain synthesis method that creates 557K training samples by cropping objects from standard detection datasets. Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench. Experimental results demonstrate previous models' limitations in instance-aware retrieval and highlight the potential of MLLM for advanced retrieval applications. The whole training dataset, codes and models, with wide ranges of sizes, are available at https://github.com/BwLiu01/IDMR.
