Table of Contents
Fetching ...

ThinkBot: Embodied Instruction Following with Thought Chain Reasoning

Guanxing Lu, Ziwei Wang, Changliu Liu, Jiwen Lu, Yansong Tang

TL;DR

This work tackles embodied instruction following with sparse, incoherent human instructions by introducing ThinkBot, a two-module agent that reasons a thought chain to recover missing actions and interacted objects. An LLM-based instruction completer generates coherent subgoals, while a multimodal object localizer uses semantic maps and a learned object-correlation graph to predict interaction locations. Evaluated on the ALFRED benchmark, ThinkBot achieves state-of-the-art performance across seen and unseen splits with better success rates and execution efficiency, especially in challenging tasks requiring open actions. The results highlight the value of integrating thought-chain reasoning with multimodal grounding for robust, long-horizon EIF in interactive environments.

Abstract

Embodied Instruction Following (EIF) requires agents to complete human instruction by interacting objects in complicated surrounding environments. Conventional methods directly consider the sparse human instruction to generate action plans for agents, which usually fail to achieve human goals because of the instruction incoherence in action descriptions. On the contrary, we propose ThinkBot that reasons the thought chain in human instruction to recover the missing action descriptions, so that the agent can successfully complete human goals by following the coherent instruction. Specifically, we first design an instruction completer based on large language models to recover the missing actions with interacted objects between consecutive human instruction, where the perceived surrounding environments and the completed sub-goals are considered for instruction completion. Based on the partially observed scene semantic maps, we present an object localizer to infer the position of interacted objects for agents to achieve complex human goals. Extensive experiments in the simulated environment show that our ThinkBot outperforms the state-of-the-art EIF methods by a sizable margin in both success rate and execution efficiency.

ThinkBot: Embodied Instruction Following with Thought Chain Reasoning

TL;DR

This work tackles embodied instruction following with sparse, incoherent human instructions by introducing ThinkBot, a two-module agent that reasons a thought chain to recover missing actions and interacted objects. An LLM-based instruction completer generates coherent subgoals, while a multimodal object localizer uses semantic maps and a learned object-correlation graph to predict interaction locations. Evaluated on the ALFRED benchmark, ThinkBot achieves state-of-the-art performance across seen and unseen splits with better success rates and execution efficiency, especially in challenging tasks requiring open actions. The results highlight the value of integrating thought-chain reasoning with multimodal grounding for robust, long-horizon EIF in interactive environments.

Abstract

Embodied Instruction Following (EIF) requires agents to complete human instruction by interacting objects in complicated surrounding environments. Conventional methods directly consider the sparse human instruction to generate action plans for agents, which usually fail to achieve human goals because of the instruction incoherence in action descriptions. On the contrary, we propose ThinkBot that reasons the thought chain in human instruction to recover the missing action descriptions, so that the agent can successfully complete human goals by following the coherent instruction. Specifically, we first design an instruction completer based on large language models to recover the missing actions with interacted objects between consecutive human instruction, where the perceived surrounding environments and the completed sub-goals are considered for instruction completion. Based on the partially observed scene semantic maps, we present an object localizer to infer the position of interacted objects for agents to achieve complex human goals. Extensive experiments in the simulated environment show that our ThinkBot outperforms the state-of-the-art EIF methods by a sizable margin in both success rate and execution efficiency.
Paper Structure (12 sections, 3 equations, 6 figures, 2 tables)

This paper contains 12 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison between conventional EIF methods (Prompter inoue2022prompter) and our ThinkBot. Existing methods directly leverage sparse human instruction to generate action sequence, which usually get stuck due to the incoherence of instruction. Our ThinkBot recovers missing action descriptions by reasoning the thought chain in sparse human instruction, and can successfully complete challenging tasks.
  • Figure 2: The overall pipeline of ThinkBot, which consists of an instruction completer and an object localizer. The instruction completer generates the coherent instruction with interacted objects based on sparse human instruction and the current visual perception results, and the object localizer predicts the position of the interacted object for manipulation and navigation.
  • Figure 3: Input and output of the instruction completer based on LLMs. The input contains system message describing the world properties and agent message demonstrating perceived environment information. The output includes the thought chain in sparse human instruction and missing subgoals with interacted objects.
  • Figure 4: The overall pipeline of the multimodal object localizer, which uses recovered instruction and observed semantic map to predict object positions for interaction. The object correlation graph is also learned to strengthen the map features.
  • Figure 5: Visualization of the agent action sequence acquired by Prompter+ (top) and our ThinkBot (bottom), where our method can recover the missing actions with interacted instances 'Open Fridge' and 'Open Cabinet' to successfully achieve the human goal.
  • ...and 1 more figures