ARDuP: Active Region Video Diffusion for Universal Policies

Shuaiyi Huang; Mara Levy; Zhenyu Jiang; Anima Anandkumar; Yuke Zhu; Linxi Fan; De-An Huang; Abhinav Shrivastava

ARDuP: Active Region Video Diffusion for Universal Policies

Shuaiyi Huang, Mara Levy, Zhenyu Jiang, Anima Anandkumar, Yuke Zhu, Linxi Fan, De-An Huang, Abhinav Shrivastava

TL;DR

ARDuP redefines video-based universal policy learning by conditioning video planning on active interaction regions. It decomposes planning into a latent active-region generator and a latent video planner, with a latent inverse dynamics decoder to recover actions, and it automatically derives pseudo-active regions from motion cues and segmentation without manual labeling. Empirical results on CLIPort and BridgeData v2 show substantial improvements in success rates and realism of generated plans, including real-world transfer capabilities. The approach underscores the importance of interactive-region focus in video-conditioned policies for robust, scalable robotics across tasks and environments.

Abstract

Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emphasizes the generation of active regions, i.e. potential interaction areas, enhancing the conditional policy's focus on interactive areas critical for task execution. This innovative framework integrates active region conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. We validate ARDuP's efficacy via extensive experiments on simulator CLIPort and the real-world dataset BridgeData v2, achieving notable improvements in success rates and generating convincingly realistic video plans.

ARDuP: Active Region Video Diffusion for Universal Policies

TL;DR

Abstract

Paper Structure (24 sections, 4 equations, 6 figures, 2 tables)

This paper contains 24 sections, 4 equations, 6 figures, 2 tables.

Introduction
Related Work
Video Diffusion Models for Decision Making
Active Region for Vision and Robotics
Problem Formulation
Preliminaries
Latent Unified Predictive Decision Process conditioned on Active Region (LUPDP-AR)
Active Region Video Diffusion for Universal Policies (ARDuP)
Active Region Generator
Active Region Supervision from Videos
Latent Active Region Diffusion
Universal Video Planner conditioned on Active Region
Task Specific Action Decoding from Latent Sequence
Latent Inverse Dynamics Model
Action Execution
...and 9 more sections

Figures (6)

Figure 1: Given the task text and initial frame, we aim to generate a video as a planner. With active region conditioning (bottom left), our method ARDuP generates frames where the robot arm successfully picks up the white block (bottom), unlike the incorrect targeting of a purple block when w/o active region input (top), showing ARDuP's effectiveness in producing task-aligned video sequences.
Figure 2: Overview of our Active Region Video Diffusion for Universal Policies (ARDuP). Starting with a video, we use Co-Tracker to identify moving points at the initial frame, which SAM then uses to generate pseudo masks of active regions. These pseudo masks delineate the pseudo active region, serving as supervision for training our Latent Active Region Diffusion Model. The generated latent active region informs the Latent Video Diffusion Model, resulting in a synthesized video latent sequence. Finally, the Latent Inverse Dynamic Model decodes the generated latent sequence into a corresponding action sequence.
Figure 3: Qualitative comparison of the generated video plans on CLIPort shridhar2022cliport unseen test tasks. Our method successfully generates a video that packs the white block, while the counterpart w/o active region generates frames that pick the wrong object (the red block) or fails to correctly generate the arm position for grasping objects. We demonstrate higher visual quality, especially around objects of interest, thanks to the guidance provided by our active regions.
Figure 4: Qualitative comparison of the generated video plans on BridgeData v2 walke2023bridgedata. Our method successfully picks up the correct object (sushi) or places the object in the appropriate location (in the cup or colander), while the UniPi* baseline selects the wrong object (cucumber) or places it incorrectly (on the desk), showing the advantage of our method.
Figure 5: Task Loss on CLIPort shridhar2022cliport test set. Our method (green) shows significantly lower task loss compared to that without the active region (red), underscoring the active region’s role in enhancing generation quality.
...and 1 more figures

ARDuP: Active Region Video Diffusion for Universal Policies

TL;DR

Abstract

ARDuP: Active Region Video Diffusion for Universal Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (6)