Visual Environment-Interactive Planning for Embodied Complex-Question Answering
Ning Lan, Baoshan Ou, Xuemei Xie, Guangming Shi
TL;DR
The paper addresses Embodied Complex-Question Answering, formalized as $G=f(Q,E)$, where a robot must plan actions in an indoor environment to answer complex questions. It introduces a sequential planning framework built on a four-layer indoor visual scene graph and a structured semantic space, employing an Observation-Planning-Action cycle with Language Parsing, Rule-based Plan, and LLM-based Plan to ground reasoning in perception. Key contributions include the Structured Semantic Space, the ECQA dataset with template-based and multi-step questions, and empirical evidence showing improved performance and robustness in both simulated HM3D environments and real-world demonstrations, especially for small objects and people. The work demonstrates practical impact by reducing reliance on large LLMs, enabling interpretable, environment-grounded planning for embodied QA, and pointing toward automatic scene-graph construction and dynamic environment handling as future directions.
Abstract
This study focuses on Embodied Complex-Question Answering task, which means the embodied robot need to understand human questions with intricate structures and abstract semantics. The core of this task lies in making appropriate plans based on the perception of the visual environment. Existing methods often generate plans in a once-for-all manner, i.e., one-step planning. Such approach rely on large models, without sufficient understanding of the environment. Considering multi-step planning, the framework for formulating plans in a sequential manner is proposed in this paper. To ensure the ability of our framework to tackle complex questions, we create a structured semantic space, where hierarchical visual perception and chain expression of the question essence can achieve iterative interaction. This space makes sequential task planning possible. Within the framework, we first parse human natural language based on a visual hierarchical scene graph, which can clarify the intention of the question. Then, we incorporate external rules to make a plan for current step, weakening the reliance on large models. Every plan is generated based on feedback from visual perception, with multiple rounds of interaction until an answer is obtained. This approach enables continuous feedback and adjustment, allowing the robot to optimize its action strategy. To test our framework, we contribute a new dataset with more complex questions. Experimental results demonstrate that our approach performs excellently and stably on complex tasks. And also, the feasibility of our approach in real-world scenarios has been established, indicating its practical applicability.
