Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs

Jialou Wang; Manli Zhu; Yulei Li; Honglei Li; Longzhi Yang; Wai Lok Woo

Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs

Jialou Wang, Manli Zhu, Yulei Li, Honglei Li, Longzhi Yang, Wai Lok Woo

TL;DR

Detect2Interact addresses the challenge of fine-grained object key-field localization in VQA by integrating SAM-based segmentation, Vision Studio semantics, and GPT-4 reasoning to map user queries to specific object parts. The approach deploys a three-module pipeline for zero-shot semantic detection, target object retrieval, and visual key field detection, enabling precise interaction with object components. Qualitative results show improved ground-truth localization of key fields and robustness in zero-shot object detection compared to MiniGPT-v2, with practical implications for robotics and augmented reality. The work advances interactive multimodal AI by bridging detailed visual segmentation with common-sense reasoning to support actionable VQA responses.

Abstract

Localization plays a crucial role in enhancing the practicality and precision of VQA systems. By enabling fine-grained identification and interaction with specific parts of an object, it significantly improves the system's ability to provide contextually relevant and spatially accurate responses, crucial for applications in dynamic environments like robotics and augmented reality. However, traditional systems face challenges in accurately mapping objects within images to generate nuanced and spatially aware responses. In this work, we introduce "Detect2Interact", which addresses these challenges by introducing an advanced approach for fine-grained object visual key field detection. First, we use the segment anything model (SAM) to generate detailed spatial maps of objects in images. Next, we use Vision Studio to extract semantic object descriptions. Third, we employ GPT-4's common sense knowledge, bridging the gap between an object's semantics and its spatial map. As a result, Detect2Interact achieves consistent qualitative results on object key field detection across extensive test cases and outperforms the existing VQA system with object detection by providing a more reasonable and finer visual representation.

Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs

TL;DR

Abstract

Paper Structure (11 sections, 11 figures)

This paper contains 11 sections, 11 figures.

Related Works
Overview of Detect2Interact
Zero-shot Semantic Object Detection
Target Object Retrieval
Visual Key Field Detection
Prompt Details
Qualitative Evaluation
System Settings
Test Cases
Limitation
Conclusion

Figures (11)

Figure 1: An overview of our Detect2Interact framework. (a) Given an image, we first adopt SAM SAM to segment everything within this image, generating spatial maps of all objects. We then use Vision Studio proposed by Microsoft to obtain objects' semantics. Finally, zero-shot object detection is achieved by combining the spatial maps and object semantics. (b) Given a user query, we utilize the common sense knowledge of ChatGPT to extract the target object (e.g., the "mug") and to interpret the user action (e.g., "grab"). By feeding the extracted information back to ChatGPT, the key-field semantic of the target object (e.g., the "handle") is determined. (c) We then feed the spatial matrix of the target object into ChatGPT to recognize its specific visual key field that fits the user action.
Figure 2: Illustration of object-level composition, in which $\otimes$ represents composition operation.
Figure 3: Illustration of object decomposition and composition.
Figure 4: Visual input (left) and output (right).
Figure 5: Comparison with MiniGPT-v2 minigptv2 on zero-shot object detection.
...and 6 more figures

Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs

TL;DR

Abstract

Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (11)