This&That: Language-Gesture Controlled Video Generation for Robot Planning

Boyang Wang; Nikhil Sridhar; Chao Feng; Mark Van der Merwe; Adam Fishman; Nima Fazeli; Jeong Joon Park

This&That: Language-Gesture Controlled Video Generation for Robot Planning

Boyang Wang, Nikhil Sridhar, Chao Feng, Mark Van der Merwe, Adam Fishman, Nima Fazeli, Jeong Joon Park

TL;DR

This&That tackles the challenge of enabling robots to understand and act on simple language-gesture instructions by coupling a language-gesture conditioned video diffusion model with a video-based behavioral cloning policy (DiVA). By conditioning video generation on both text and 2D gesture cues, the framework produces action-planning videos that better reflect user intent, which are then translated into robot actions through a Transformer-based BC model. On real-robot-leaning Bridge datasets and simulated Isaac Gym rollouts, the approach achieves superior video quality, alignment to user intent, and higher task success, particularly in ambiguous scenes where language alone falters. The work highlights a practical path toward multi-task, human-robot collaboration by treating video predictions as intermediate planners that can be robustly mapped to manipulation actions, with clear avenues for real-world transfer and extension to longer-horizon tasks.

Abstract

Clear, interpretable instructions are invaluable when attempting any complex task. Good instructions help to clarify the task and even anticipate the steps needed to solve it. In this work, we propose a robot learning framework for communicating, planning, and executing a wide range of tasks, dubbed This&That. This&That solves general tasks by leveraging video generative models, which, through training on internet-scale data, contain rich physical and semantic context. In this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task communication with simple human instructions, 2) controllable video generation that respects user intent, and 3) translating visual plans into robot actions. This&That uses language-gesture conditioning to generate video predictions, as a succinct and unambiguous alternative to existing language-only methods, especially in complex and uncertain environments. These video predictions are then fed into a behavior cloning architecture dubbed Diffusion Video to Action (DiVA), which outperforms prior state-of-the-art behavior cloning and video-based planning methods by substantial margins.

This&That: Language-Gesture Controlled Video Generation for Robot Planning

TL;DR

Abstract

Paper Structure (40 sections, 2 equations, 13 figures, 5 tables)

This paper contains 40 sections, 2 equations, 13 figures, 5 tables.

Introduction
Related Work
Overview
Language-Gesture Conditioned Video Diffusion Models
Language-Conditioned Finetuning
Gesture-Conditioned Training and Inference
Video-Conditioned Behavioral Cloning
Experiments
Video Generation Experiments and Comparisons
Synthetic Rollout Experiments
Limitations
Conclusion
Document Overview
Additional Experiments and Ablation Studies
Qualitative Comparison with Contemporary Video Generative Models
...and 25 more sections

Figures (13)

Figure 1: Video generation for robot planning. Using the same initial frame, our video diffusion model can effectively generate various action sequences, each conditioned on different pairs of gestures and text prompts. Our approach accommodates simple deictic language such as this and that. Our gesture conditioning proves critical for precise video control.
Figure 2: Video Diffusion Model Architecture. Our video diffusion model architecture with first frame image and language-gesture conditioning.
Figure 3: Video-based Planning Qualitative Results. We present three examples to compare This&That with AVDC ko2023learning. The gesture locations are overlayed in the leftmost frame. Our method can generate action sequences effectively with higher visual quality, even when using deictic words.
Figure 4: Video-conditioned Behavior Cloning Architecture. Our Diffusion Video to Action (DiVA) model utilizes a Transformer encoder-decoder architecture to convert video plans into executable robot actions by compressing image embeddings using TokenLearner ryoo2021tokenlearner and referencing video plan tokens with cross-attention.
Figure 5: Simulation Rollout Qualitative Comparison. We compare the simulated rollouts of our language-gesture conditioned model against AVDC. AVDC struggles to interpret complex text instructions and resolve scene ambiguities. In contrast, our model effectively translates user intent into actions, even with simple language commands.
...and 8 more figures

This&That: Language-Gesture Controlled Video Generation for Robot Planning

TL;DR

Abstract

This&That: Language-Gesture Controlled Video Generation for Robot Planning

Authors

TL;DR

Abstract

Table of Contents

Figures (13)