Table of Contents
Fetching ...

Multi-Level Compositional Reasoning for Interactive Instruction Following

Suvaansh Bhambri, Byeonghwi Kim, Jonghyun Choi

TL;DR

This work tackles long-horizon interactive instruction following by introducing MCR-Agent, a three-level hierarchical architecture that decomposes tasks into subgoals, navigation, and object interaction. A Policy Composition Controller selects subgoals from language, a Master Policy handles navigation and triggers Interaction Policies, and a suite of Interaction Policies executes precise manipulations, aided by an Object Encoding Module and a Loop Escape mechanism to avoid deadlocks. On ALFRED, MCR-Agent achieves a $2.03\%$ absolute improvement in PLWSR on unseen environments without rule-based planning or semantic memory, while offering interpretable subgoals and faster learning through modular specialization. The results demonstrate strong efficiency and generalization, with ablations confirming the contributions of OEM, NIH, and MIP to overall performance and robustness. This approach provides a scalable path for robust, interpretable embodied AI in long-horizon domestic tasks without heavy external supervision.

Abstract

Robotic agents performing domestic chores by natural language directives are required to master the complex job of navigating environment and interacting with objects in the environments. The tasks given to the agents are often composite thus are challenging as completing them require to reason about multiple subtasks, e.g., bring a cup of coffee. To address the challenge, we propose to divide and conquer it by breaking the task into multiple subgoals and attend to them individually for better navigation and interaction. We call it Multi-level Compositional Reasoning Agent (MCR-Agent). Specifically, we learn a three-level action policy. At the highest level, we infer a sequence of human-interpretable subgoals to be executed based on language instructions by a high-level policy composition controller. At the middle level, we discriminatively control the agent's navigation by a master policy by alternating between a navigation policy and various independent interaction policies. Finally, at the lowest level, we infer manipulation actions with the corresponding object masks using the appropriate interaction policy. Our approach not only generates human interpretable subgoals but also achieves 2.03% absolute gain to comparable state of the arts in the efficiency metric (PLWSR in unseen set) without using rule-based planning or a semantic spatial memory.

Multi-Level Compositional Reasoning for Interactive Instruction Following

TL;DR

This work tackles long-horizon interactive instruction following by introducing MCR-Agent, a three-level hierarchical architecture that decomposes tasks into subgoals, navigation, and object interaction. A Policy Composition Controller selects subgoals from language, a Master Policy handles navigation and triggers Interaction Policies, and a suite of Interaction Policies executes precise manipulations, aided by an Object Encoding Module and a Loop Escape mechanism to avoid deadlocks. On ALFRED, MCR-Agent achieves a absolute improvement in PLWSR on unseen environments without rule-based planning or semantic memory, while offering interpretable subgoals and faster learning through modular specialization. The results demonstrate strong efficiency and generalization, with ablations confirming the contributions of OEM, NIH, and MIP to overall performance and robustness. This approach provides a scalable path for robust, interpretable embodied AI in long-horizon domestic tasks without heavy external supervision.

Abstract

Robotic agents performing domestic chores by natural language directives are required to master the complex job of navigating environment and interacting with objects in the environments. The tasks given to the agents are often composite thus are challenging as completing them require to reason about multiple subtasks, e.g., bring a cup of coffee. To address the challenge, we propose to divide and conquer it by breaking the task into multiple subgoals and attend to them individually for better navigation and interaction. We call it Multi-level Compositional Reasoning Agent (MCR-Agent). Specifically, we learn a three-level action policy. At the highest level, we infer a sequence of human-interpretable subgoals to be executed based on language instructions by a high-level policy composition controller. At the middle level, we discriminatively control the agent's navigation by a master policy by alternating between a navigation policy and various independent interaction policies. Finally, at the lowest level, we infer manipulation actions with the corresponding object masks using the appropriate interaction policy. Our approach not only generates human interpretable subgoals but also achieves 2.03% absolute gain to comparable state of the arts in the efficiency metric (PLWSR in unseen set) without using rule-based planning or a semantic spatial memory.
Paper Structure (55 sections, 7 equations, 14 figures, 8 tables)

This paper contains 55 sections, 7 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: The proposed 'Multi-level compositional reasoning' contrasted to 'Flat policy reasoning'. The flat policy reasoning has been employed in prior arts shridhar2020alfredsingh2021factorizingpashevich2021episodicnguyen2021look, training an agent to directly learn the low-level actions. On the contrary, our multi-level policy decomposes a long-horizon task into multiple subtasks and leverages the high-level abstract planning, which enables an agent to better address long-horizon planning.
  • Figure 2: Model Architecture. $I_t^d$ denotes an RGB frame from an explorable direction, $d \in [0, D]$, at the time step, $t$, where $d = 0$ indicates the egocentric direction. We encode $I_t^d$ using a pretrained ResNet and acquire a visual feature, $v_t^d$. $\hat{x}_i$ denotes each step-by-step instruction. $\hat{l}_{T,v}$, $\hat{l}_{T,m}$ denotes the encoded instruction for the 'interactive perception module' and 'action prediction module' respectively. $\hat{l}_{T:T+1,n}$ denotes the encoded 'subtask' instruction (Sec. 'Master Policy'). $T$ refers to the index of the current subgoal. In our master policy, OEM outputs object encoding, $o_t$, using $\hat{l}_{T:T+1,n}$. 'VL-Ground' uses dynamic filters to capture the correspondence between visual and language features and outputs attended visual features, $\hat{v}_{t}^{pan}$ and $\hat{v}_{t}^{ego}$.
  • Figure 3: Multi-level policy learns faster and more effective action sequences. Plot (a) shows the learning curves (success rates vs. epochs) of the hierarchical and flat policy agents for unseen and seen environments. Plot (b) presents the average length of an episode traversed by a hierarchical or flat policy for the seven task types shridhar2020alfred. The flat policy denotes the NIH ablated agent, #(c) in Table \ref{['tab:ablation']}.
  • Figure 4: Loop escape module (LEM) for escaping deadlock states. The objective of the agent at the current time step is to move to a target object (a garbage can). Figure (a) and (b) show an example of a deadlock state and the behavior of the loop escape module when finding the target object. Each dark-blue square denotes the position of the agent. $\circ$ denotes the target object that the agent should navigate to. $\xrightarrow{}$ denotes the view direction of the agent. The dashed $\circ$ and $\xrightarrow{}$ indicate that the target object is invisible to the agent due to occlusion. $\xrightarrow{}$ denotes actions taken by the agent in a deadlock state. The loop escape module cancels the current action that causes the deadlock state, denoted by $\times$, and takes a stochastic action, denoted by $\xrightarrow{}$.
  • Figure 5: Learning curves of subgoal policies. The figure provides the learning curves for the subgoal policy training as discussed in Sec. 'Training and Evaluation.'
  • ...and 9 more figures