Image Generation as a Visual Planner for Robotic Manipulation

Ye Pang

Image Generation as a Visual Planner for Robotic Manipulation

Ye Pang

TL;DR

<3-5 sentence high-level summary> This paper tackles the challenge of unifying perception, planning, and action in robotics by turning a pretrained image diffusion/transformer into a visual planner that generates short manipulation sequences as 3×3 grids. It introduces a two-branch framework that uses LoRA adapters to condition on either natural language instructions or a 2D end-effector trajectory, both anchored to a first observed frame. Through experiments on JacoPlay, BridgeV2, and RT-1, the approach yields smooth, coherent videos that follow semantic or spatial cues and often outperforms task-specific baselines in perceptual quality and fidelity. The work demonstrates that high-capacity image generators encode transferable temporal priors and can serve as efficient visual planners for robotics, enabling video-like planning with minimal supervision and data requirements.

Abstract

Generating realistic robotic manipulation videos is an important step toward unifying perception, planning, and action in embodied agents. While existing video diffusion models require large domain-specific datasets and struggle to generalize, recent image generation models trained on language-image corpora exhibit strong compositionality, including the ability to synthesize temporally coherent grid images. This suggests a latent capacity for video-like generation even without explicit temporal modeling. We explore whether such models can serve as visual planners for robots when lightly adapted using LoRA finetuning. We propose a two-part framework that includes: (1) text-conditioned generation, which uses a language instruction and the first frame, and (2) trajectory-conditioned generation, which uses a 2D trajectory overlay and the same initial frame. Experiments on the Jaco Play dataset, Bridge V2, and the RT1 dataset show that both modes produce smooth, coherent robot videos aligned with their respective conditions. Our findings indicate that pretrained image generators encode transferable temporal priors and can function as video-like robotic planners under minimal supervision. Code is released at \href{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}.

Image Generation as a Visual Planner for Robotic Manipulation

TL;DR

Abstract

Image Generation as a Visual Planner for Robotic Manipulation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)