Table of Contents
Fetching ...

Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

Yike Wu, Necva Bolucu, Stephen Wan, Dadong Wang, Jiahao Xia, Jian Zhang

Abstract

Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models~(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models~(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbf{SGREC}, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a VLM to construct a query-driven scene graph that explicitly encodes spatial relationships, descriptive captions, and object interactions relevant to the given query. By leveraging this scene graph, we bridge the gap between low-level image regions and higher-level semantic understanding required by LLMs. Finally, an LLM infers the target object from the structured textual representation provided by the scene graph, responding with detailed explanations for its decisions that ensure interpretability in the inference process. Extensive experiments show that SGREC achieves top-1 accuracy on most zero-shot REC benchmarks, including RefCOCO val (66.78\%), RefCOCO+ testB (53.43\%), and RefCOCOg val (73.28\%), highlighting its strong visual scene understanding.

Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs

Abstract

Zero-shot referring expression comprehension (REC) aims to locate target objects in images given natural language queries without relying on task-specific training data, demanding strong visual understanding capabilities. Existing Vision-Language Models~(VLMs), such as CLIP, commonly address zero-shot REC by directly measuring feature similarities between textual queries and image regions. However, these methods struggle to capture fine-grained visual details and understand complex object relationships. Meanwhile, Large Language Models~(LLMs) excel at high-level semantic reasoning, their inability to directly abstract visual features into textual semantics limits their application in REC tasks. To overcome these limitations, we propose \textbf{SGREC}, an interpretable zero-shot REC method leveraging query-driven scene graphs as structured intermediaries. Specifically, we first employ a VLM to construct a query-driven scene graph that explicitly encodes spatial relationships, descriptive captions, and object interactions relevant to the given query. By leveraging this scene graph, we bridge the gap between low-level image regions and higher-level semantic understanding required by LLMs. Finally, an LLM infers the target object from the structured textual representation provided by the scene graph, responding with detailed explanations for its decisions that ensure interpretability in the inference process. Extensive experiments show that SGREC achieves top-1 accuracy on most zero-shot REC benchmarks, including RefCOCO val (66.78\%), RefCOCO+ testB (53.43\%), and RefCOCOg val (73.28\%), highlighting its strong visual scene understanding.

Paper Structure

This paper contains 18 sections, 1 equation, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Problem definition of zero-shot referring expression comprehension.
  • Figure 2: Considering the spatial information, object appearance, and semantic relationships within the query, the generated query-driven scene graphs employ the relevant objects' coordinates, captions, and visual interactions to describe visual scenes, facilitating inference by LLMs comprehensively. The queries from the datasets are natural and written in an informal style.
  • Figure 3: Pipeline of the proposed SGREC. In Step 1, SGREC begins by extracting nouns, predicting categories, and inferring subjects from the input query and the original image. It then identifies query-related objects by selecting those with matched labels. In Step 2, a scene graph is generated in three parts: class names and coordinates from the detector to encode spatial information, generated image captions for each object to describe their appearance, and predicted relation triplets to capture interactions between objects. Finally, in Step 3, SGREC analyzes the query and the generated scene graph to infer the index of the target object.
  • Figure 4: Detailed prompts used in SGREC, covering modules for subject inference, object caption generation, interaction extraction, and final LLM-based inference.
  • Figure 5: Illustration of the subject inference, including its inputs and outputs.
  • ...and 5 more figures