Table of Contents
Fetching ...

ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning

Pengfei Luo, Jingbo Zhou, Tong Xu, Yuan Xia, Linli Xu, Enhong Chen

TL;DR

ImageScope addresses the fragmentation of LGIR by introducing a training-free, three-stage framework that unifies text-to-image retrieval for TIR, CIR, and Chat-IR. It converts visual content to language via a captioner, uses a reasoner to synthesize target semantics in the language space, and applies a two-tier reflective process (local predicate verification and global pairwise evaluation) to refine results. Stage 1 decomposes instructions into atomic operations with multi-granularity descriptions; Stage 2 verifies these propositions with a relaxation-based scorer; Stage 3 uses a separate evaluator to perform global comparisons with a reference image. Across six LGIR datasets, ImageScope achieves state-of-the-art or competitive results in zero-shot settings, while ablations and efficiency analyses confirm the contribution of each stage and the framework’s generality across model choices, making it a practical, interpretable solution for diverse LGIR tasks.

Abstract

With the proliferation of images in online content, language-guided image retrieval (LGIR) has emerged as a research hotspot over the past decade, encompassing a variety of subtasks with diverse input forms. While the development of large multimodal models (LMMs) has significantly facilitated these tasks, existing approaches often address them in isolation, requiring the construction of separate systems for each task. This not only increases system complexity and maintenance costs, but also exacerbates challenges stemming from language ambiguity and complex image content, making it difficult for retrieval systems to provide accurate and reliable results. To this end, we propose ImageScope, a training-free, three-stage framework that leverages collective reasoning to unify LGIR tasks. The key insight behind the unification lies in the compositional nature of language, which transforms diverse LGIR tasks into a generalized text-to-image retrieval process, along with the reasoning of LMMs serving as a universal verification to refine the results. To be specific, in the first stage, we improve the robustness of the framework by synthesizing search intents across varying levels of semantic granularity using chain-of-thought (CoT) reasoning. In the second and third stages, we then reflect on retrieval results by verifying predicate propositions locally, and performing pairwise evaluations globally. Experiments conducted on six LGIR datasets demonstrate that ImageScope outperforms competitive baselines. Comprehensive evaluations and ablation studies further confirm the effectiveness of our design.

ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning

TL;DR

ImageScope addresses the fragmentation of LGIR by introducing a training-free, three-stage framework that unifies text-to-image retrieval for TIR, CIR, and Chat-IR. It converts visual content to language via a captioner, uses a reasoner to synthesize target semantics in the language space, and applies a two-tier reflective process (local predicate verification and global pairwise evaluation) to refine results. Stage 1 decomposes instructions into atomic operations with multi-granularity descriptions; Stage 2 verifies these propositions with a relaxation-based scorer; Stage 3 uses a separate evaluator to perform global comparisons with a reference image. Across six LGIR datasets, ImageScope achieves state-of-the-art or competitive results in zero-shot settings, while ablations and efficiency analyses confirm the contribution of each stage and the framework’s generality across model choices, making it a practical, interpretable solution for diverse LGIR tasks.

Abstract

With the proliferation of images in online content, language-guided image retrieval (LGIR) has emerged as a research hotspot over the past decade, encompassing a variety of subtasks with diverse input forms. While the development of large multimodal models (LMMs) has significantly facilitated these tasks, existing approaches often address them in isolation, requiring the construction of separate systems for each task. This not only increases system complexity and maintenance costs, but also exacerbates challenges stemming from language ambiguity and complex image content, making it difficult for retrieval systems to provide accurate and reliable results. To this end, we propose ImageScope, a training-free, three-stage framework that leverages collective reasoning to unify LGIR tasks. The key insight behind the unification lies in the compositional nature of language, which transforms diverse LGIR tasks into a generalized text-to-image retrieval process, along with the reasoning of LMMs serving as a universal verification to refine the results. To be specific, in the first stage, we improve the robustness of the framework by synthesizing search intents across varying levels of semantic granularity using chain-of-thought (CoT) reasoning. In the second and third stages, we then reflect on retrieval results by verifying predicate propositions locally, and performing pairwise evaluations globally. Experiments conducted on six LGIR datasets demonstrate that ImageScope outperforms competitive baselines. Comprehensive evaluations and ablation studies further confirm the effectiveness of our design.

Paper Structure

This paper contains 24 sections, 6 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Illustration of three language-guided image retrieval tasks: Text-to-Image Retrieval (TIR), Composed Image Retrieval (CIR) and Chat-based Image Retrieval (Chat-IR).
  • Figure 2: Illustration of the proposed ImageScope framework.
  • Figure 3: Ablation study of each designed stage on five LGIR datasets. We show the results two scales of CLIP.
  • Figure 4: Performance of Chat-IR on VisDial DBLP:conf/cvpr/DasKGSYMPB17/VisDial compared with Zero-shot CLIP DBLP:conf/icml/JiaYXCPPLSLD21/OpenCLIP and PlugIR DBLP:conf/acl/LeeYPYY24/PlugIR. Complete results are shown in Table \ref{['table:appendix_ChatIR_VisDial']}.
  • Figure 5: Inference efficiency analysis. The left figure shows the average inference latency, and the right one shows the overall inference time. Numbers are shown in Tab. \ref{['table:appendix_effiency_latency']} and \ref{['table:appendix_effiency_overall']}.
  • ...and 7 more figures