Table of Contents
Fetching ...

Hierarchical Matching and Reasoning for Multi-Query Image Retrieval

Zhong Ji, Zhihao Li, Yan Zhang, Haoran Wang, Yanwei Pang, Xuelong Li

TL;DR

This paper tackles MQIR by introducing a three-level hierarchical framework that jointly models fine-grained local-region–region-query alignments, context-aware global image–text alignment, and high-level correlations across multiple region-query pairs. The Hierarchical Matching and Reasoning Network (HMRN) comprises a Scalar-based Matching (SM) module for local and global similarities and a Vector-based Reasoning (VR) module for intra- and inter-correlation reasoning, with an ensemble strategy to fuse three similarity levels. Empirical results on Visual Genome show substantial improvements over state-of-the-art methods, including notable gains in R@1 and reductions in Mean Rank, supported by extensive ablations that validate each component. The work advances MQIR by effectively leveraging hierarchical structure and high-level correlations to achieve robust, scalable retrieval in complex, multi-query scenarios, with practical implications for fine-grained cross-modal understanding and retrieval systems.

Abstract

As a promising field, Multi-Query Image Retrieval (MQIR) aims at searching for the semantically relevant image given multiple region-specific text queries. Existing works mainly focus on a single-level similarity between image regions and text queries, which neglects the hierarchical guidance of multi-level similarities and results in incomplete alignments. Besides, the high-level semantic correlations that intrinsically connect different region-query pairs are rarely considered. To address above limitations, we propose a novel Hierarchical Matching and Reasoning Network (HMRN) for MQIR. It disentangles MQIR into three hierarchical semantic representations, which is responsible to capture fine-grained local details, contextual global scopes, and high-level inherent correlations. HMRN comprises two modules: Scalar-based Matching (SM) module and Vector-based Reasoning (VR) module. Specifically, the SM module characterizes the multi-level alignment similarity, which consists of a fine-grained local-level similarity and a context-aware global-level similarity. Afterwards, the VR module is developed to excavate the potential semantic correlations among multiple region-query pairs, which further explores the high-level reasoning similarity. Finally, these three-level similarities are aggregated into a joint similarity space to form the ultimate similarity. Extensive experiments on the benchmark dataset demonstrate that our HMRN substantially surpasses the current state-of-the-art methods. For instance, compared with the existing best method Drill-down, the metric R@1 in the last round is improved by 23.4%. Our source codes will be released at https://github.com/LZH-053/HMRN.

Hierarchical Matching and Reasoning for Multi-Query Image Retrieval

TL;DR

This paper tackles MQIR by introducing a three-level hierarchical framework that jointly models fine-grained local-region–region-query alignments, context-aware global image–text alignment, and high-level correlations across multiple region-query pairs. The Hierarchical Matching and Reasoning Network (HMRN) comprises a Scalar-based Matching (SM) module for local and global similarities and a Vector-based Reasoning (VR) module for intra- and inter-correlation reasoning, with an ensemble strategy to fuse three similarity levels. Empirical results on Visual Genome show substantial improvements over state-of-the-art methods, including notable gains in R@1 and reductions in Mean Rank, supported by extensive ablations that validate each component. The work advances MQIR by effectively leveraging hierarchical structure and high-level correlations to achieve robust, scalable retrieval in complex, multi-query scenarios, with practical implications for fine-grained cross-modal understanding and retrieval systems.

Abstract

As a promising field, Multi-Query Image Retrieval (MQIR) aims at searching for the semantically relevant image given multiple region-specific text queries. Existing works mainly focus on a single-level similarity between image regions and text queries, which neglects the hierarchical guidance of multi-level similarities and results in incomplete alignments. Besides, the high-level semantic correlations that intrinsically connect different region-query pairs are rarely considered. To address above limitations, we propose a novel Hierarchical Matching and Reasoning Network (HMRN) for MQIR. It disentangles MQIR into three hierarchical semantic representations, which is responsible to capture fine-grained local details, contextual global scopes, and high-level inherent correlations. HMRN comprises two modules: Scalar-based Matching (SM) module and Vector-based Reasoning (VR) module. Specifically, the SM module characterizes the multi-level alignment similarity, which consists of a fine-grained local-level similarity and a context-aware global-level similarity. Afterwards, the VR module is developed to excavate the potential semantic correlations among multiple region-query pairs, which further explores the high-level reasoning similarity. Finally, these three-level similarities are aggregated into a joint similarity space to form the ultimate similarity. Extensive experiments on the benchmark dataset demonstrate that our HMRN substantially surpasses the current state-of-the-art methods. For instance, compared with the existing best method Drill-down, the metric R@1 in the last round is improved by 23.4%. Our source codes will be released at https://github.com/LZH-053/HMRN.
Paper Structure (35 sections, 21 equations, 9 figures, 11 tables)

This paper contains 35 sections, 21 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Conceptual comparison of traditional ITR and MQIR. (a) ITR performs image retrieval with a single query, which contains coarse-grained global semantic information, while MQIR performs image retrieval with multiple queries, which contains more fine-grained correspondences.
  • Figure 2: Explanation of our structural hierarchy, which disentangles MQIR into three levels. (a) The local level captures the fine-grained alignment between regions and region-specific queries. (b) The global level leverages the context information to conduct the comprehensive alignment. (c) The high level explores the inherent correlations from the perspective of intra-correlations and inter-correlations. The former denotes the semantic correlation within each region-query pair and the latter denotes the semantic correlation among different region-query pairs.
  • Figure 3: An overview of the proposed HMRN method. Firstly, images and queries are encoded into feature representations. Then, depending on different similarity forms, two proposed modules are introduced: ① The Scalar-based Matching (SM) module focuses on measuring multi-level alignment similarity, including the Local-level Matching approach and the Global-level Matching approach. ② The Vector-based Reasoning (VR) module attempts to explore the high-level reasoning similarity, which leverages the Intra-Correlation Mining approach to delve the intra-correlation within each region-query pair and employs the Inter-Correlation Reasoning approach to reason the inter-correlation among different region-query pairs. Finally, the three-level similarities are aggregated to form the ultimate similarity.
  • Figure 4: An illustration of the inter-correlation graph among region-query pairs, in which the different colored lines denote different types of semantic correlations.
  • Figure 5: Impacts of $\lambda$ for HMRN I-T and HMRN T-I. Note that Avg. Recall refers to the left vertical coordinate and Avg. MR refers to the right vertical coordinate.
  • ...and 4 more figures