Table of Contents
Fetching ...

Multi-Modal Grounded Planning and Efficient Replanning For Learning Embodied Agents with A Few Examples

Taewoong Kim, Byeonghwi Kim, Jonghyun Choi

TL;DR

FLARE addresses data-efficient embodied planning by grounding LLM-based planning in environmental perception through two components: a Multi-Modal Planner (MMP) and Environment Adaptive Replanning (EAR). MMP retrieves top-k multimodal demonstrations and prompts an LLM to generate subgoal sequences, while EAR substitutes undetected objects with semantically similar observed ones to ground the plan without repeated LLM calls. The approach achieves state-of-the-art performance on ALFRED in few-shot settings, with GPT-4 variants delivering up to $+24.46$ percentage points on unseen tasks, and ablations confirm the complementary benefits of MMP and EAR. Qualitative and robotic-task evaluations illustrate improved grounding, robustness to language variation, and practical applicability with limited data. Overall, FLARE reduces annotation costs and demonstrates effective, grounded planning for embodied agents in realistic environments.

Abstract

Learning a perception and reasoning module for robotic assistants to plan steps to perform complex tasks based on natural language instructions often requires large free-form language annotations, especially for short high-level instructions. To reduce the cost of annotation, large language models (LLMs) are used as a planner with few data. However, when elaborating the steps, even the state-of-the-art planner that uses LLMs mostly relies on linguistic common sense, often neglecting the status of the environment at command reception, resulting in inappropriate plans. To generate plans grounded in the environment, we propose FLARE (Few-shot Language with environmental Adaptive Replanning Embodied agent), which improves task planning using both language command and environmental perception. As language instructions often contain ambiguities or incorrect expressions, we additionally propose to correct the mistakes using visual cues from the agent. The proposed scheme allows us to use a few language pairs thanks to the visual cues and outperforms state-of-the-art approaches. Our code is available at https://github.com/snumprlab/flare.

Multi-Modal Grounded Planning and Efficient Replanning For Learning Embodied Agents with A Few Examples

TL;DR

FLARE addresses data-efficient embodied planning by grounding LLM-based planning in environmental perception through two components: a Multi-Modal Planner (MMP) and Environment Adaptive Replanning (EAR). MMP retrieves top-k multimodal demonstrations and prompts an LLM to generate subgoal sequences, while EAR substitutes undetected objects with semantically similar observed ones to ground the plan without repeated LLM calls. The approach achieves state-of-the-art performance on ALFRED in few-shot settings, with GPT-4 variants delivering up to percentage points on unseen tasks, and ablations confirm the complementary benefits of MMP and EAR. Qualitative and robotic-task evaluations illustrate improved grounding, robustness to language variation, and practical applicability with limited data. Overall, FLARE reduces annotation costs and demonstrates effective, grounded planning for embodied agents in realistic environments.

Abstract

Learning a perception and reasoning module for robotic assistants to plan steps to perform complex tasks based on natural language instructions often requires large free-form language annotations, especially for short high-level instructions. To reduce the cost of annotation, large language models (LLMs) are used as a planner with few data. However, when elaborating the steps, even the state-of-the-art planner that uses LLMs mostly relies on linguistic common sense, often neglecting the status of the environment at command reception, resulting in inappropriate plans. To generate plans grounded in the environment, we propose FLARE (Few-shot Language with environmental Adaptive Replanning Embodied agent), which improves task planning using both language command and environmental perception. As language instructions often contain ambiguities or incorrect expressions, we additionally propose to correct the mistakes using visual cues from the agent. The proposed scheme allows us to use a few language pairs thanks to the visual cues and outperforms state-of-the-art approaches. Our code is available at https://github.com/snumprlab/flare.

Paper Structure

This paper contains 31 sections, 3 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of the proposed FLARE. Our agent consists of (1) 'Multi-Modal Planner (MMP)' and (2) 'Environment Adaptive Replanning (EAR)'. MMP takes into account both the agent's initial surrounding views and received instructions to generate a sequence of subgoals by prompting an LLM (e.g., GPT-4). When the agent gets stuck while executing a plan, EAR adjusts the ungrounded plan to a physically grounded one with visual cues.
  • Figure 2: Detailed architecture of FLARE. It comprises 'Multi-Modal Planner (MMP)' and 'Environment Adaptive Replanning (EAR)'. MMP retrieves the top $k$ relevant training data pairs with instruction and expert demonstration (indicated with Expert Demon.), based on the agent's initial panoramic surrounding views and language instructions, then plans a sequence of actions through LLMs (e.g., GPT-4) with these examples. When agent fails to locate the target object (e.g., 'TrashCan'), it requests replanning via EAR. Using visual observations and semantic similarity, EAR identifies the most similar object available within the scene and replaces the missing one (e.g., 'GarbageCan').
  • Figure 3: Multi-Modal Planner. MMP selects top $k$ expert demonstrations based on 'multi-modal similarity' (Eq. (\ref{['eq:similarity_multi']})) and then converts them into subgoal triplets $(A_n, O_n, R_n)$. MMP uses subgoal triplets, along with a text prompt, to guide an LLM in generating task-specific subgoal sequences from natural language instructions.
  • Figure 4: Environment Adaptive Replanning. EAR corrects a plan by listing detected objects and calculating semantic similarities to replace inaccurately referenced items (e.g., TrashCan). This ensures that the plan is grounded in the environment.
  • Figure 5: Benefits of proposed multi-modal planner (MMP). An agent without MMP misinterprets the task, simply placing a SoapBar in the SinkBasin. In contrast, an agent with MMP seems to comprehend an objective of cleaning, generating a plausible plan and subsequently completing the task successfully.
  • ...and 6 more figures