Table of Contents
Fetching ...

Visual Environment-Interactive Planning for Embodied Complex-Question Answering

Ning Lan, Baoshan Ou, Xuemei Xie, Guangming Shi

TL;DR

The paper addresses Embodied Complex-Question Answering, formalized as $G=f(Q,E)$, where a robot must plan actions in an indoor environment to answer complex questions. It introduces a sequential planning framework built on a four-layer indoor visual scene graph and a structured semantic space, employing an Observation-Planning-Action cycle with Language Parsing, Rule-based Plan, and LLM-based Plan to ground reasoning in perception. Key contributions include the Structured Semantic Space, the ECQA dataset with template-based and multi-step questions, and empirical evidence showing improved performance and robustness in both simulated HM3D environments and real-world demonstrations, especially for small objects and people. The work demonstrates practical impact by reducing reliance on large LLMs, enabling interpretable, environment-grounded planning for embodied QA, and pointing toward automatic scene-graph construction and dynamic environment handling as future directions.

Abstract

This study focuses on Embodied Complex-Question Answering task, which means the embodied robot need to understand human questions with intricate structures and abstract semantics. The core of this task lies in making appropriate plans based on the perception of the visual environment. Existing methods often generate plans in a once-for-all manner, i.e., one-step planning. Such approach rely on large models, without sufficient understanding of the environment. Considering multi-step planning, the framework for formulating plans in a sequential manner is proposed in this paper. To ensure the ability of our framework to tackle complex questions, we create a structured semantic space, where hierarchical visual perception and chain expression of the question essence can achieve iterative interaction. This space makes sequential task planning possible. Within the framework, we first parse human natural language based on a visual hierarchical scene graph, which can clarify the intention of the question. Then, we incorporate external rules to make a plan for current step, weakening the reliance on large models. Every plan is generated based on feedback from visual perception, with multiple rounds of interaction until an answer is obtained. This approach enables continuous feedback and adjustment, allowing the robot to optimize its action strategy. To test our framework, we contribute a new dataset with more complex questions. Experimental results demonstrate that our approach performs excellently and stably on complex tasks. And also, the feasibility of our approach in real-world scenarios has been established, indicating its practical applicability.

Visual Environment-Interactive Planning for Embodied Complex-Question Answering

TL;DR

The paper addresses Embodied Complex-Question Answering, formalized as , where a robot must plan actions in an indoor environment to answer complex questions. It introduces a sequential planning framework built on a four-layer indoor visual scene graph and a structured semantic space, employing an Observation-Planning-Action cycle with Language Parsing, Rule-based Plan, and LLM-based Plan to ground reasoning in perception. Key contributions include the Structured Semantic Space, the ECQA dataset with template-based and multi-step questions, and empirical evidence showing improved performance and robustness in both simulated HM3D environments and real-world demonstrations, especially for small objects and people. The work demonstrates practical impact by reducing reliance on large LLMs, enabling interpretable, environment-grounded planning for embodied QA, and pointing toward automatic scene-graph construction and dynamic environment handling as future directions.

Abstract

This study focuses on Embodied Complex-Question Answering task, which means the embodied robot need to understand human questions with intricate structures and abstract semantics. The core of this task lies in making appropriate plans based on the perception of the visual environment. Existing methods often generate plans in a once-for-all manner, i.e., one-step planning. Such approach rely on large models, without sufficient understanding of the environment. Considering multi-step planning, the framework for formulating plans in a sequential manner is proposed in this paper. To ensure the ability of our framework to tackle complex questions, we create a structured semantic space, where hierarchical visual perception and chain expression of the question essence can achieve iterative interaction. This space makes sequential task planning possible. Within the framework, we first parse human natural language based on a visual hierarchical scene graph, which can clarify the intention of the question. Then, we incorporate external rules to make a plan for current step, weakening the reliance on large models. Every plan is generated based on feedback from visual perception, with multiple rounds of interaction until an answer is obtained. This approach enables continuous feedback and adjustment, allowing the robot to optimize its action strategy. To test our framework, we contribute a new dataset with more complex questions. Experimental results demonstrate that our approach performs excellently and stably on complex tasks. And also, the feasibility of our approach in real-world scenarios has been established, indicating its practical applicability.

Paper Structure

This paper contains 24 sections, 2 equations, 8 figures, 3 tables, 2 algorithms.

Figures (8)

  • Figure 1: An illustration of our framework for task planning. The embodied agent is the core the system. Inside the "Embodied Agent" block, we construct a structured semantic space, enabling continuous interaction between natural language instructions and visual perception. After receiving the natural language question, the embodied agent first select the Language Parsing tool to obtain the pattern of question($t_0$). The intent is to obtain the attribute ($A$) of a book in the small layer ($V_4$). By combining the pattern with rule-based analysis, the optimal observation point to answer the question is at the book of small object layer. Therefore, the embodied agent next select Rule-based Plan and Observation tools to formulate current plan for navigation by interacting with the environment($t_1$, $t_2$, $t_3$). When the embodied agent determines that it cannot give a plan based on rules, it will choose the LLM-based Plan tool, utilizing the powerful language ability of LLMs to give plans based on feedback from environmental interactions($t_4$).
  • Figure 2: A taxonomy of task planning methods in the era of foundation models with the related work from this section referenced.
  • Figure 3: A comparison between one-step and sequential (ours) task planning at $i$ time step.
  • Figure 4: An example of the hierarchical structure of an indoor scene graph. The scene graph consists of four levels: floor, room, big object, and small object.
  • Figure 5: Examples of our framework experiment results in simulated environments. We show the results for different types of questions and stopping steps for our method.
  • ...and 3 more figures