Think-Program-reCtify: 3D Situated Reasoning with Large Language Models
Qingrong He, Kejun Lin, Shizhe Chen, Anwen Hu, Qin Jin
TL;DR
This paper tackles 3D Situated Reasoning (3D-SR) in egocentric environments by introducing LLM-TPC, a training-free framework that combines large language models with grounded 3D perception through a Think-Program-ReCtify loop. The method decouples 3D perception (segmentation, classification, attributes, and spatial relations) from reasoning, enabling the LLM to plan steps, ground them as executable Python programs calling perception APIs, and iteratively rectify failures to produce final answers. On the SQA3D benchmark, LLM-TPC achieves state-of-the-art results without training, and its ensemble with end-to-end models further boosts performance, especially when ground-truth 3D information is available. The approach emphasizes interpretability and robustness, demonstrates strong performance on knowledge-dependent questions, and highlights areas for improvement in perception quality and dataset annotation. Overall, the work offers a versatile, training-free framework that leverages LLMs for complex, multi-skill 3D reasoning with practical implications for embodied agents and vision-language reasoning tasks.
Abstract
This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. The task remains challenging as it requires comprehensive 3D perception and complex reasoning skills. End-to-end models trained on supervised data for 3D situated reasoning suffer from data scarcity and generalization ability. Inspired by the recent success of leveraging large language models (LLMs) for visual reasoning, we propose LLM-TPC, a novel framework that leverages the planning, tool usage, and reflection capabilities of LLMs through a ThinkProgram-reCtify loop. The Think phase first decomposes the compositional question into a sequence of steps, and then the Program phase grounds each step to a piece of code and calls carefully designed 3D visual perception modules. Finally, the Rectify phase adjusts the plan and code if the program fails to execute. Experiments and analysis on the SQA3D benchmark demonstrate the effectiveness, interpretability and robustness of our method. Our code is publicly available at https://qingrongh.github.io/LLM-TPC/.
