Egocentric Vision Language Planning

Zhirui Fang; Ming Yang; Weishuai Zeng; Boyu Li; Junpeng Yue; Ziluo Ding; Xiu Li; Zongqing Lu

Egocentric Vision Language Planning

Zhirui Fang, Ming Yang, Weishuai Zeng, Boyu Li, Junpeng Yue, Ziluo Ding, Xiu Li, Zongqing Lu

TL;DR

This work addresses the challenge of grounding large multimodal models for embodied agents operating from an egocentric viewpoint in household environments. It introduces EgoPlan, which couples a diffusion-based one-step dynamics model $p_{ heta}(x_{t+1}|x_t,a_t)$ with an LMM planner that decomposes goals into subgoals and selects actions by comparing predicted outcomes to subgoals. To achieve cross-environment generalization, it integrates motion-aware conditioning via optical flow using a lightweight $f_{t,t+1}$ predictor and ControlNet, plus LoRA-based style transfer for appearance adaptation. Experiments on VH-1.5M/VirtualHome and cross-environment evaluation in Habitat2.0 show improved long-horizon task success and promising generalization, though the approach remains limited to encapsulated skills without low-level control.

Abstract

We explore leveraging large multi-modal models (LMMs) and text2image models to build a more general embodied agent. LMMs excel in planning long-horizon tasks over symbolic abstractions but struggle with grounding in the physical world, often failing to accurately identify object positions in images. A bridge is needed to connect LMMs to the physical world. The paper proposes a novel approach, egocentric vision language planning (EgoPlan), to handle long-horizon tasks from an egocentric perspective in varying household scenarios. This model leverages a diffusion model to simulate the fundamental dynamics between states and actions, integrating techniques like style transfer and optical flow to enhance generalization across different environmental dynamics. The LMM serves as a planner, breaking down instructions into sub-goals and selecting actions based on their alignment with these sub-goals, thus enabling more generalized and effective decision-making. Experiments show that EgoPlan improves long-horizon task success rates from the egocentric view compared to baselines across household scenarios.

Egocentric Vision Language Planning

TL;DR

with an LMM planner that decomposes goals into subgoals and selects actions by comparing predicted outcomes to subgoals. To achieve cross-environment generalization, it integrates motion-aware conditioning via optical flow using a lightweight

predictor and ControlNet, plus LoRA-based style transfer for appearance adaptation. Experiments on VH-1.5M/VirtualHome and cross-environment evaluation in Habitat2.0 show improved long-horizon task success and promising generalization, though the approach remains limited to encapsulated skills without low-level control.

Abstract

Paper Structure (24 sections, 3 equations, 16 figures, 4 tables)

This paper contains 24 sections, 3 equations, 16 figures, 4 tables.

Introduction
Related Work
Diffusion Model
World Model for Decision-making
VH-1.5M Dataset
Method
Diffusion-Based Dynamics Model
Learning Dynamics
Generalization
Planning with Dynamics Model
Goal Decomposition
One-Step Planner
Experiment
Visual Quality
VirtualHome Tasks
...and 9 more sections

Figures (16)

Figure 1: An illustration sample in VH-1.5M, which includes current image observation, next image observation given the text action, semantic segmentation map, depth map, and optical flow map.
Figure 2: Overview of EgoPlan. The left side features a one-step planner that provides the agent with decision-making capabilities, while the right side includes a world model (dynamics model) that provides the agent with an understanding of the current environment.
Figure 3: Examples of the generated image of the next observation in VirtualHome. The tasks from rows 1 to 4 are: close the fridge, switch off the light, turn left, and turn right.
Figure 4: The success rate on 12 tasks for all the methods. Note that tasks 1-6 occur inside one room, while tasks 7-12 take place in two rooms.
Figure 5: Examples of the generated image subgoals. The first row is the original image, and the second row is the image subgoal generated based on the text subgoal.
...and 11 more figures

Egocentric Vision Language Planning

TL;DR

Abstract

Egocentric Vision Language Planning

Authors

TL;DR

Abstract

Table of Contents

Figures (16)