Egocentric Vision Language Planning
Zhirui Fang, Ming Yang, Weishuai Zeng, Boyu Li, Junpeng Yue, Ziluo Ding, Xiu Li, Zongqing Lu
TL;DR
This work addresses the challenge of grounding large multimodal models for embodied agents operating from an egocentric viewpoint in household environments. It introduces EgoPlan, which couples a diffusion-based one-step dynamics model $p_{ heta}(x_{t+1}|x_t,a_t)$ with an LMM planner that decomposes goals into subgoals and selects actions by comparing predicted outcomes to subgoals. To achieve cross-environment generalization, it integrates motion-aware conditioning via optical flow using a lightweight $f_{t,t+1}$ predictor and ControlNet, plus LoRA-based style transfer for appearance adaptation. Experiments on VH-1.5M/VirtualHome and cross-environment evaluation in Habitat2.0 show improved long-horizon task success and promising generalization, though the approach remains limited to encapsulated skills without low-level control.
Abstract
We explore leveraging large multi-modal models (LMMs) and text2image models to build a more general embodied agent. LMMs excel in planning long-horizon tasks over symbolic abstractions but struggle with grounding in the physical world, often failing to accurately identify object positions in images. A bridge is needed to connect LMMs to the physical world. The paper proposes a novel approach, egocentric vision language planning (EgoPlan), to handle long-horizon tasks from an egocentric perspective in varying household scenarios. This model leverages a diffusion model to simulate the fundamental dynamics between states and actions, integrating techniques like style transfer and optical flow to enhance generalization across different environmental dynamics. The LMM serves as a planner, breaking down instructions into sub-goals and selecting actions based on their alignment with these sub-goals, thus enabling more generalized and effective decision-making. Experiments show that EgoPlan improves long-horizon task success rates from the egocentric view compared to baselines across household scenarios.
