Table of Contents
Fetching ...

Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

Qingrong He, Kejun Lin, Shizhe Chen, Anwen Hu, Qin Jin

TL;DR

This paper tackles 3D Situated Reasoning (3D-SR) in egocentric environments by introducing LLM-TPC, a training-free framework that combines large language models with grounded 3D perception through a Think-Program-ReCtify loop. The method decouples 3D perception (segmentation, classification, attributes, and spatial relations) from reasoning, enabling the LLM to plan steps, ground them as executable Python programs calling perception APIs, and iteratively rectify failures to produce final answers. On the SQA3D benchmark, LLM-TPC achieves state-of-the-art results without training, and its ensemble with end-to-end models further boosts performance, especially when ground-truth 3D information is available. The approach emphasizes interpretability and robustness, demonstrates strong performance on knowledge-dependent questions, and highlights areas for improvement in perception quality and dataset annotation. Overall, the work offers a versatile, training-free framework that leverages LLMs for complex, multi-skill 3D reasoning with practical implications for embodied agents and vision-language reasoning tasks.

Abstract

This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. The task remains challenging as it requires comprehensive 3D perception and complex reasoning skills. End-to-end models trained on supervised data for 3D situated reasoning suffer from data scarcity and generalization ability. Inspired by the recent success of leveraging large language models (LLMs) for visual reasoning, we propose LLM-TPC, a novel framework that leverages the planning, tool usage, and reflection capabilities of LLMs through a ThinkProgram-reCtify loop. The Think phase first decomposes the compositional question into a sequence of steps, and then the Program phase grounds each step to a piece of code and calls carefully designed 3D visual perception modules. Finally, the Rectify phase adjusts the plan and code if the program fails to execute. Experiments and analysis on the SQA3D benchmark demonstrate the effectiveness, interpretability and robustness of our method. Our code is publicly available at https://qingrongh.github.io/LLM-TPC/.

Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

TL;DR

This paper tackles 3D Situated Reasoning (3D-SR) in egocentric environments by introducing LLM-TPC, a training-free framework that combines large language models with grounded 3D perception through a Think-Program-ReCtify loop. The method decouples 3D perception (segmentation, classification, attributes, and spatial relations) from reasoning, enabling the LLM to plan steps, ground them as executable Python programs calling perception APIs, and iteratively rectify failures to produce final answers. On the SQA3D benchmark, LLM-TPC achieves state-of-the-art results without training, and its ensemble with end-to-end models further boosts performance, especially when ground-truth 3D information is available. The approach emphasizes interpretability and robustness, demonstrates strong performance on knowledge-dependent questions, and highlights areas for improvement in perception quality and dataset annotation. Overall, the work offers a versatile, training-free framework that leverages LLMs for complex, multi-skill 3D reasoning with practical implications for embodied agents and vision-language reasoning tasks.

Abstract

This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. The task remains challenging as it requires comprehensive 3D perception and complex reasoning skills. End-to-end models trained on supervised data for 3D situated reasoning suffer from data scarcity and generalization ability. Inspired by the recent success of leveraging large language models (LLMs) for visual reasoning, we propose LLM-TPC, a novel framework that leverages the planning, tool usage, and reflection capabilities of LLMs through a ThinkProgram-reCtify loop. The Think phase first decomposes the compositional question into a sequence of steps, and then the Program phase grounds each step to a piece of code and calls carefully designed 3D visual perception modules. Finally, the Rectify phase adjusts the plan and code if the program fails to execute. Experiments and analysis on the SQA3D benchmark demonstrate the effectiveness, interpretability and robustness of our method. Our code is publicly available at https://qingrongh.github.io/LLM-TPC/.
Paper Structure (32 sections, 1 equation, 22 figures, 8 tables)

This paper contains 32 sections, 1 equation, 22 figures, 8 tables.

Figures (22)

  • Figure 1: Situated reasoning in 3D scenes. It aims to answer complex questions given egocentric situation in a 3D environment. The green arrow indicates the position and orientation described by the situation, and the green box refers to the target object.
  • Figure 2: Existing methods for 3D-SR task. End-to-end methods lack interpretability and cannot accomplish the 3D-SR task in a zero-shot or few-shot way. Language-only methods fail to conduct multi-modal reasoning and deliver reasonable results.
  • Figure 3: Overall Framework of LLM-TPC. LLM-TPC comprises three key components: the 3D Visual Perception Module equips the LLM with 3D context perception abilities, the Prompt Preparation Stage prepares prompts for reasoning, and the Reasoning Stage involves iterative Think-Program-reCtify loops.
  • Figure 4: Step-by-step plans generated in the Think phase.
  • Figure 5: Program and execution results in the Program phase.
  • ...and 17 more figures