EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM
Shuang Ao, Flora D. Salim, Simon Khan
TL;DR
EMAC+ tackles the gap of static, text-centric planning in embodied robotics by introducing a bidirectional LLM–VLM framework where high-level plans $x_{a,1:N}$ proposed by the LLM are refined in real time through VLM-driven visual feedback $s_v$, enabling the LLM to internalize environment dynamics. A collaborative training loop uses imitation learning with Direct Preference Optimization (DPO) to align the VLM with an LLM expert, while a LoRA-based fine-tuning updates the LLM using environment trajectories collected during closed-loop execution. The approach grounds textual observations via a PDDL-based text world and maintains memory of trajectories to support replanning, yielding a tightly integrated loop between planning and control. Empirical results on ALFWorld and RT-1 show superior task success, robustness to visual noise, and data-efficient learning, with extensive ablations underscoring the pivotal role of LLM replanning and bidirectional feedback in achieving robust performance. Overall, EMAC+ advances adaptive, domain-aware embodied planning by learning environment dynamics through interactive experience while preserving the interpretability of language-guided reasoning.
Abstract
Although LLMs demonstrate proficiency in several text-based reasoning and planning tasks, their implementation in robotics control is constrained by significant deficiencies: (1) LLM agents are designed to work mainly with textual inputs rather than visual conditions; (2) Current multimodal agents treat LLMs as static planners, which separates their reasoning from environment dynamics, resulting in actions that do not take domain-specific knowledge into account; and (3) LLMs are not designed to learn from visual interactions, which makes it harder for them to make better policies for specific domains. In this paper, we introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM via a bidirectional training paradigm. Unlike existing methods, EMAC+ dynamically refines high-level textual plans generated by an LLM using real-time feedback from a VLM executing low-level visual control tasks. We address critical limitations of previous models by enabling the LLM to internalize visual environment dynamics directly through interactive experience, rather than relying solely on static symbolic mappings. Extensive experimental evaluations on ALFWorld and RT-1 benchmarks demonstrate that EMAC+ achieves superior task performance, robustness against noisy observations, and efficient learning. We also conduct thorough ablation studies and provide detailed analyses of success and failure cases.
