Answerability Fields: Answerable Location Estimation via Diffusion Models
Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Motoaki Kawanabe
TL;DR
The paper tackles agent-perspective question answering in 3D indoor environments by predicting where an agent should view to answer questions. It introduces Answerability Fields (AnsFields), a grid-based map of location-wise answerability scores computed with a VQA model and predicted in unseen scenes via a diffusion model conditioned on top-down maps and the question. The approach is grounded in ScanNet/ScanQA data, with OFA refining the answerability scores and InstructPix2Pix predicting AnsFields, achieving improvements over baselines in agent-perspective QA. Empirically, AnsFields enables more accurate and efficient QA by guiding viewpoint selection, reducing the need to explore irrelevant regions. This work advances embodied AI by coupling map-conditioned reasoning with diffusion-based prediction to enhance interactions between agents and complex indoor environments.
Abstract
In an era characterized by advancements in artificial intelligence and robotics, enabling machines to interact with and understand their environment is a critical research endeavor. In this paper, we propose Answerability Fields, a novel approach to predicting answerability within complex indoor environments. Leveraging a 3D question answering dataset, we construct a comprehensive Answerability Fields dataset, encompassing diverse scenes and questions from ScanNet. Using a diffusion model, we successfully infer and evaluate these Answerability Fields, demonstrating the importance of objects and their locations in answering questions within a scene. Our results showcase the efficacy of Answerability Fields in guiding scene-understanding tasks, laying the foundation for their application in enhancing interactions between intelligent agents and their environments.
