Table of Contents
Fetching ...

Answerability Fields: Answerable Location Estimation via Diffusion Models

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Motoaki Kawanabe

TL;DR

The paper tackles agent-perspective question answering in 3D indoor environments by predicting where an agent should view to answer questions. It introduces Answerability Fields (AnsFields), a grid-based map of location-wise answerability scores computed with a VQA model and predicted in unseen scenes via a diffusion model conditioned on top-down maps and the question. The approach is grounded in ScanNet/ScanQA data, with OFA refining the answerability scores and InstructPix2Pix predicting AnsFields, achieving improvements over baselines in agent-perspective QA. Empirically, AnsFields enables more accurate and efficient QA by guiding viewpoint selection, reducing the need to explore irrelevant regions. This work advances embodied AI by coupling map-conditioned reasoning with diffusion-based prediction to enhance interactions between agents and complex indoor environments.

Abstract

In an era characterized by advancements in artificial intelligence and robotics, enabling machines to interact with and understand their environment is a critical research endeavor. In this paper, we propose Answerability Fields, a novel approach to predicting answerability within complex indoor environments. Leveraging a 3D question answering dataset, we construct a comprehensive Answerability Fields dataset, encompassing diverse scenes and questions from ScanNet. Using a diffusion model, we successfully infer and evaluate these Answerability Fields, demonstrating the importance of objects and their locations in answering questions within a scene. Our results showcase the efficacy of Answerability Fields in guiding scene-understanding tasks, laying the foundation for their application in enhancing interactions between intelligent agents and their environments.

Answerability Fields: Answerable Location Estimation via Diffusion Models

TL;DR

The paper tackles agent-perspective question answering in 3D indoor environments by predicting where an agent should view to answer questions. It introduces Answerability Fields (AnsFields), a grid-based map of location-wise answerability scores computed with a VQA model and predicted in unseen scenes via a diffusion model conditioned on top-down maps and the question. The approach is grounded in ScanNet/ScanQA data, with OFA refining the answerability scores and InstructPix2Pix predicting AnsFields, achieving improvements over baselines in agent-perspective QA. Empirically, AnsFields enables more accurate and efficient QA by guiding viewpoint selection, reducing the need to explore irrelevant regions. This work advances embodied AI by coupling map-conditioned reasoning with diffusion-based prediction to enhance interactions between agents and complex indoor environments.

Abstract

In an era characterized by advancements in artificial intelligence and robotics, enabling machines to interact with and understand their environment is a critical research endeavor. In this paper, we propose Answerability Fields, a novel approach to predicting answerability within complex indoor environments. Leveraging a 3D question answering dataset, we construct a comprehensive Answerability Fields dataset, encompassing diverse scenes and questions from ScanNet. Using a diffusion model, we successfully infer and evaluate these Answerability Fields, demonstrating the importance of objects and their locations in answering questions within a scene. Our results showcase the efficacy of Answerability Fields in guiding scene-understanding tasks, laying the foundation for their application in enhancing interactions between intelligent agents and their environments.
Paper Structure (19 sections, 1 equation, 4 figures, 2 tables)

This paper contains 19 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: We propose Answerability Fields to make agent efficiently understanding scenes. We compute the score of answerability to the questions in each location by using strong Visual Question Answering (VQA) model called OFA.
  • Figure 2: Represents a method for generating answerability fields, which calculates the probability for each sequence to return an appropriate answer when a question and a image are entered into VQA model.
  • Figure 3: This is the overview of generating Answerability Fields via Instruct Pix2Pix. On train, learning to predict noise by adding noise to the correct image. The correct images were trained and comparatively evaluated with only Answerability drawn and, TopPoints highlighted and BoundingBoxes on the top-down view images of the scene
  • Figure 4: We used InstructPix2Pix to predict answerability fields for a variety of unknown scenes. The figure shows AnsFields to the question and the panoramic image taken in the location of the highest score of answerability.