Table of Contents
Fetching ...

Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents

Byeonghwi Kim, Jinyeon Kim, Yuyeong Kim, Cheolhong Min, Jonghyun Choi

TL;DR

This work empirically shows that the agent with the proposed CAPEAM achieves state-of-the-art performance in various metrics using a challenging interactive instruction following benchmark in both seen and unseen environments by large margins.

Abstract

Accomplishing household tasks requires to plan step-by-step actions considering the consequences of previous actions. However, the state-of-the-art embodied agents often make mistakes in navigating the environment and interacting with proper objects due to imperfect learning by imitating experts or algorithmic planners without such knowledge. To improve both visual navigation and object interaction, we propose to consider the consequence of taken actions by CAPEAM (Context-Aware Planning and Environment-Aware Memory) that incorporates semantic context (e.g., appropriate objects to interact with) in a sequence of actions, and the changed spatial arrangement and states of interacted objects (e.g., location that the object has been moved to) in inferring the subsequent actions. We empirically show that the agent with the proposed CAPEAM achieves state-of-the-art performance in various metrics using a challenging interactive instruction following benchmark in both seen and unseen environments by large margins (up to +10.70% in unseen env.).

Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents

TL;DR

This work empirically shows that the agent with the proposed CAPEAM achieves state-of-the-art performance in various metrics using a challenging interactive instruction following benchmark in both seen and unseen environments by large margins.

Abstract

Accomplishing household tasks requires to plan step-by-step actions considering the consequences of previous actions. However, the state-of-the-art embodied agents often make mistakes in navigating the environment and interacting with proper objects due to imperfect learning by imitating experts or algorithmic planners without such knowledge. To improve both visual navigation and object interaction, we propose to consider the consequence of taken actions by CAPEAM (Context-Aware Planning and Environment-Aware Memory) that incorporates semantic context (e.g., appropriate objects to interact with) in a sequence of actions, and the changed spatial arrangement and states of interacted objects (e.g., location that the object has been moved to) in inferring the subsequent actions. We empirically show that the agent with the proposed CAPEAM achieves state-of-the-art performance in various metrics using a challenging interactive instruction following benchmark in both seen and unseen environments by large margins (up to +10.70% in unseen env.).
Paper Structure (37 sections, 4 equations, 9 figures, 3 tables)

This paper contains 37 sections, 4 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Overview of the proposed 'Context-Aware Planning (CAP)' and 'Environment-Aware Memory (EAM)'. The CAP incorporates 'context' (i.e., task-relevant objects) of the task (denoted by ✓ in generating a sequence of sub-goals, compared with the output without the CAP, denoted by ✗). The detailed planners then predict a sequence of agent-executable actions for each respective sub-goal. The agent keeps the state changes of objects and their masks in the EAM and utilizes them when necessary. Even when the agent may not predict the mask of the plate due to occlusion, it can still interact with the plate thanks to the mask remembered in EAM, leading to successful task completion.
  • Figure 2: Model Architecture. Our agent consists of (1) 'context-aware planning (CAP)' and (2) 'environment-aware memory (EAM)'. Taking the natural language instructions, the sub-goal planner in the CAP predicts 'context' (i.e., task-relevant objects) and generates a sequence of 'sub-goal frames' that are sub-goals with a predicted action and placeholders for which object should be used with it. Then the objects in the 'sub-goal frames' are completed with predicted objects (the context). For each planned sub-goal, a corresponding detailed planner generates a sequence of 'executable actions.' In the EAM, the agent maintains the semantic spatial map by integrating the predicted depths and masks into 3D world-coordinates along with the state changes of objects with their masks to utilize them during task completion.
  • Figure 3: Context-Aware Planning (CAP). It consists of a 'sub-goal planner' and a set of 'detailed planners' for each sub-goal to generate 'executable actions.' The sub-goal planner first predicts a set of objects related to the task, which we call 'Context.' Then, the 'sub-goal frame sequence generator' in the sub-goal planner generates a sequence of 'sub-goal frames.' Finally, the 'meta-classes' in each sub-goal frame are replaced with the corresponding objects in the context, resulting in the final sub-goal. A 'detailed planner' translates the sub-goal to executable actions.
  • Figure 4: Environment-Aware Memory (EAM). The agent updates the semantic spatial map using predicted depths and object masks for scene information. 'Retrospective Object Recognition' preserves the latest object mask to approximate the current object's mask when mask prediction fails. 'Object Relocation Tracking' stores the most recent location of each relocated object and discards it as a future navigation target. 'Object Location Caching' remembers the locations and masks of objects whose states change.
  • Figure 5: Benefit of Context-Aware Planning (CAP). In the two qualitative examples, the 'contexts' are denoted by $c_O$ in yellow, $c_M$ in blue, and $c_R$ in green colored boxes. While our CAPEAM plans a sub-goal sequence with task-relevant objects, 'CAPEAM w/o CAP' interacts with task-irrelevant objects (i.e., Potato or Knife) and consequently fails.
  • ...and 4 more figures