Table of Contents
Fetching ...

Egocentric Vision Language Planning

Zhirui Fang, Ming Yang, Weishuai Zeng, Boyu Li, Junpeng Yue, Ziluo Ding, Xiu Li, Zongqing Lu

TL;DR

This work addresses the challenge of grounding large multimodal models for embodied agents operating from an egocentric viewpoint in household environments. It introduces EgoPlan, which couples a diffusion-based one-step dynamics model $p_{ heta}(x_{t+1}|x_t,a_t)$ with an LMM planner that decomposes goals into subgoals and selects actions by comparing predicted outcomes to subgoals. To achieve cross-environment generalization, it integrates motion-aware conditioning via optical flow using a lightweight $f_{t,t+1}$ predictor and ControlNet, plus LoRA-based style transfer for appearance adaptation. Experiments on VH-1.5M/VirtualHome and cross-environment evaluation in Habitat2.0 show improved long-horizon task success and promising generalization, though the approach remains limited to encapsulated skills without low-level control.

Abstract

We explore leveraging large multi-modal models (LMMs) and text2image models to build a more general embodied agent. LMMs excel in planning long-horizon tasks over symbolic abstractions but struggle with grounding in the physical world, often failing to accurately identify object positions in images. A bridge is needed to connect LMMs to the physical world. The paper proposes a novel approach, egocentric vision language planning (EgoPlan), to handle long-horizon tasks from an egocentric perspective in varying household scenarios. This model leverages a diffusion model to simulate the fundamental dynamics between states and actions, integrating techniques like style transfer and optical flow to enhance generalization across different environmental dynamics. The LMM serves as a planner, breaking down instructions into sub-goals and selecting actions based on their alignment with these sub-goals, thus enabling more generalized and effective decision-making. Experiments show that EgoPlan improves long-horizon task success rates from the egocentric view compared to baselines across household scenarios.

Egocentric Vision Language Planning

TL;DR

This work addresses the challenge of grounding large multimodal models for embodied agents operating from an egocentric viewpoint in household environments. It introduces EgoPlan, which couples a diffusion-based one-step dynamics model with an LMM planner that decomposes goals into subgoals and selects actions by comparing predicted outcomes to subgoals. To achieve cross-environment generalization, it integrates motion-aware conditioning via optical flow using a lightweight predictor and ControlNet, plus LoRA-based style transfer for appearance adaptation. Experiments on VH-1.5M/VirtualHome and cross-environment evaluation in Habitat2.0 show improved long-horizon task success and promising generalization, though the approach remains limited to encapsulated skills without low-level control.

Abstract

We explore leveraging large multi-modal models (LMMs) and text2image models to build a more general embodied agent. LMMs excel in planning long-horizon tasks over symbolic abstractions but struggle with grounding in the physical world, often failing to accurately identify object positions in images. A bridge is needed to connect LMMs to the physical world. The paper proposes a novel approach, egocentric vision language planning (EgoPlan), to handle long-horizon tasks from an egocentric perspective in varying household scenarios. This model leverages a diffusion model to simulate the fundamental dynamics between states and actions, integrating techniques like style transfer and optical flow to enhance generalization across different environmental dynamics. The LMM serves as a planner, breaking down instructions into sub-goals and selecting actions based on their alignment with these sub-goals, thus enabling more generalized and effective decision-making. Experiments show that EgoPlan improves long-horizon task success rates from the egocentric view compared to baselines across household scenarios.
Paper Structure (24 sections, 3 equations, 16 figures, 4 tables)

This paper contains 24 sections, 3 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: An illustration sample in VH-1.5M, which includes current image observation, next image observation given the text action, semantic segmentation map, depth map, and optical flow map.
  • Figure 2: Overview of EgoPlan. The left side features a one-step planner that provides the agent with decision-making capabilities, while the right side includes a world model (dynamics model) that provides the agent with an understanding of the current environment.
  • Figure 3: Examples of the generated image of the next observation in VirtualHome. The tasks from rows 1 to 4 are: close the fridge, switch off the light, turn left, and turn right.
  • Figure 4: The success rate on 12 tasks for all the methods. Note that tasks 1-6 occur inside one room, while tasks 7-12 take place in two rooms.
  • Figure 5: Examples of the generated image subgoals. The first row is the original image, and the second row is the image subgoal generated based on the text subgoal.
  • ...and 11 more figures