ThinkBot: Embodied Instruction Following with Thought Chain Reasoning
Guanxing Lu, Ziwei Wang, Changliu Liu, Jiwen Lu, Yansong Tang
TL;DR
This work tackles embodied instruction following with sparse, incoherent human instructions by introducing ThinkBot, a two-module agent that reasons a thought chain to recover missing actions and interacted objects. An LLM-based instruction completer generates coherent subgoals, while a multimodal object localizer uses semantic maps and a learned object-correlation graph to predict interaction locations. Evaluated on the ALFRED benchmark, ThinkBot achieves state-of-the-art performance across seen and unseen splits with better success rates and execution efficiency, especially in challenging tasks requiring open actions. The results highlight the value of integrating thought-chain reasoning with multimodal grounding for robust, long-horizon EIF in interactive environments.
Abstract
Embodied Instruction Following (EIF) requires agents to complete human instruction by interacting objects in complicated surrounding environments. Conventional methods directly consider the sparse human instruction to generate action plans for agents, which usually fail to achieve human goals because of the instruction incoherence in action descriptions. On the contrary, we propose ThinkBot that reasons the thought chain in human instruction to recover the missing action descriptions, so that the agent can successfully complete human goals by following the coherent instruction. Specifically, we first design an instruction completer based on large language models to recover the missing actions with interacted objects between consecutive human instruction, where the perceived surrounding environments and the completed sub-goals are considered for instruction completion. Based on the partially observed scene semantic maps, we present an object localizer to infer the position of interacted objects for agents to achieve complex human goals. Extensive experiments in the simulated environment show that our ThinkBot outperforms the state-of-the-art EIF methods by a sizable margin in both success rate and execution efficiency.
