Table of Contents
Fetching ...

EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM

Shuang Ao, Flora D. Salim, Simon Khan

TL;DR

EMAC+ tackles the gap of static, text-centric planning in embodied robotics by introducing a bidirectional LLM–VLM framework where high-level plans $x_{a,1:N}$ proposed by the LLM are refined in real time through VLM-driven visual feedback $s_v$, enabling the LLM to internalize environment dynamics. A collaborative training loop uses imitation learning with Direct Preference Optimization (DPO) to align the VLM with an LLM expert, while a LoRA-based fine-tuning updates the LLM using environment trajectories collected during closed-loop execution. The approach grounds textual observations via a PDDL-based text world and maintains memory of trajectories to support replanning, yielding a tightly integrated loop between planning and control. Empirical results on ALFWorld and RT-1 show superior task success, robustness to visual noise, and data-efficient learning, with extensive ablations underscoring the pivotal role of LLM replanning and bidirectional feedback in achieving robust performance. Overall, EMAC+ advances adaptive, domain-aware embodied planning by learning environment dynamics through interactive experience while preserving the interpretability of language-guided reasoning.

Abstract

Although LLMs demonstrate proficiency in several text-based reasoning and planning tasks, their implementation in robotics control is constrained by significant deficiencies: (1) LLM agents are designed to work mainly with textual inputs rather than visual conditions; (2) Current multimodal agents treat LLMs as static planners, which separates their reasoning from environment dynamics, resulting in actions that do not take domain-specific knowledge into account; and (3) LLMs are not designed to learn from visual interactions, which makes it harder for them to make better policies for specific domains. In this paper, we introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM via a bidirectional training paradigm. Unlike existing methods, EMAC+ dynamically refines high-level textual plans generated by an LLM using real-time feedback from a VLM executing low-level visual control tasks. We address critical limitations of previous models by enabling the LLM to internalize visual environment dynamics directly through interactive experience, rather than relying solely on static symbolic mappings. Extensive experimental evaluations on ALFWorld and RT-1 benchmarks demonstrate that EMAC+ achieves superior task performance, robustness against noisy observations, and efficient learning. We also conduct thorough ablation studies and provide detailed analyses of success and failure cases.

EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM

TL;DR

EMAC+ tackles the gap of static, text-centric planning in embodied robotics by introducing a bidirectional LLM–VLM framework where high-level plans proposed by the LLM are refined in real time through VLM-driven visual feedback , enabling the LLM to internalize environment dynamics. A collaborative training loop uses imitation learning with Direct Preference Optimization (DPO) to align the VLM with an LLM expert, while a LoRA-based fine-tuning updates the LLM using environment trajectories collected during closed-loop execution. The approach grounds textual observations via a PDDL-based text world and maintains memory of trajectories to support replanning, yielding a tightly integrated loop between planning and control. Empirical results on ALFWorld and RT-1 show superior task success, robustness to visual noise, and data-efficient learning, with extensive ablations underscoring the pivotal role of LLM replanning and bidirectional feedback in achieving robust performance. Overall, EMAC+ advances adaptive, domain-aware embodied planning by learning environment dynamics through interactive experience while preserving the interpretability of language-guided reasoning.

Abstract

Although LLMs demonstrate proficiency in several text-based reasoning and planning tasks, their implementation in robotics control is constrained by significant deficiencies: (1) LLM agents are designed to work mainly with textual inputs rather than visual conditions; (2) Current multimodal agents treat LLMs as static planners, which separates their reasoning from environment dynamics, resulting in actions that do not take domain-specific knowledge into account; and (3) LLMs are not designed to learn from visual interactions, which makes it harder for them to make better policies for specific domains. In this paper, we introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM via a bidirectional training paradigm. Unlike existing methods, EMAC+ dynamically refines high-level textual plans generated by an LLM using real-time feedback from a VLM executing low-level visual control tasks. We address critical limitations of previous models by enabling the LLM to internalize visual environment dynamics directly through interactive experience, rather than relying solely on static symbolic mappings. Extensive experimental evaluations on ALFWorld and RT-1 benchmarks demonstrate that EMAC+ achieves superior task performance, robustness against noisy observations, and efficient learning. We also conduct thorough ablation studies and provide detailed analyses of success and failure cases.

Paper Structure

This paper contains 21 sections, 3 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: EMAC+: Embodied Multi-modal Agent for Collaborative Planning with LLM + VLM. EMAC+ takes a task instruction and pixel observations as input to plan the sequence of actions to complete tasks. Moreover, EMAC+ can also be directly deployed into the task/environment, which requires low-level control actions for the interaction. EMAC+ utilizes both the prior knowledge from LLM expert and the domain-specific knowledge from the environment dynamics.
  • Figure 2: Results on RT-1 planning tasks in the simulation. EMAC+ (ours) denotes the results of our model, and EMAC+ (alba-1, abla-2) denotes the ablation studies.
  • Figure 3: Results on RT-1 Mobile Manipulation tasks in the simulation. EMAC+ (ours) denotes the results of our model, and EMAC+ (abla-1, abla-2) denotes the ablation studies.
  • Figure 4: Ablation Studies in ALFWorld. Left: "Comparison of robustness in noise perturbation". To compare the robustness of EMAC+ and SOTA LLM agent (Reflexion), VLM agent (EMMA), we crop a random portion of the pixel observation with a specific noise rate. For LLM agent that can only interact with Textual world, we randomly replace some tokens in the textual observation with arbitrary ones. Mid: "w/o LLM re-planning" denotes the ablation study that removes the LLM finetune step by Eq. \ref{['equ:finetune_loss']}. Right: "w/ CE Loss" denotes replacing DPO loss (Eq. \ref{['equ:loss']}) with a token-level cross-entropy loss.
  • Figure 5: Task prompts for planning tasks. (Figure. \ref{['tab:RT1-planning']})
  • ...and 2 more figures