Table of Contents
Fetching ...

Image Generation as a Visual Planner for Robotic Manipulation

Ye Pang

TL;DR

<3-5 sentence high-level summary> This paper tackles the challenge of unifying perception, planning, and action in robotics by turning a pretrained image diffusion/transformer into a visual planner that generates short manipulation sequences as 3×3 grids. It introduces a two-branch framework that uses LoRA adapters to condition on either natural language instructions or a 2D end-effector trajectory, both anchored to a first observed frame. Through experiments on JacoPlay, BridgeV2, and RT-1, the approach yields smooth, coherent videos that follow semantic or spatial cues and often outperforms task-specific baselines in perceptual quality and fidelity. The work demonstrates that high-capacity image generators encode transferable temporal priors and can serve as efficient visual planners for robotics, enabling video-like planning with minimal supervision and data requirements.

Abstract

Generating realistic robotic manipulation videos is an important step toward unifying perception, planning, and action in embodied agents. While existing video diffusion models require large domain-specific datasets and struggle to generalize, recent image generation models trained on language-image corpora exhibit strong compositionality, including the ability to synthesize temporally coherent grid images. This suggests a latent capacity for video-like generation even without explicit temporal modeling. We explore whether such models can serve as visual planners for robots when lightly adapted using LoRA finetuning. We propose a two-part framework that includes: (1) text-conditioned generation, which uses a language instruction and the first frame, and (2) trajectory-conditioned generation, which uses a 2D trajectory overlay and the same initial frame. Experiments on the Jaco Play dataset, Bridge V2, and the RT1 dataset show that both modes produce smooth, coherent robot videos aligned with their respective conditions. Our findings indicate that pretrained image generators encode transferable temporal priors and can function as video-like robotic planners under minimal supervision. Code is released at \href{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}.

Image Generation as a Visual Planner for Robotic Manipulation

TL;DR

<3-5 sentence high-level summary> This paper tackles the challenge of unifying perception, planning, and action in robotics by turning a pretrained image diffusion/transformer into a visual planner that generates short manipulation sequences as 3×3 grids. It introduces a two-branch framework that uses LoRA adapters to condition on either natural language instructions or a 2D end-effector trajectory, both anchored to a first observed frame. Through experiments on JacoPlay, BridgeV2, and RT-1, the approach yields smooth, coherent videos that follow semantic or spatial cues and often outperforms task-specific baselines in perceptual quality and fidelity. The work demonstrates that high-capacity image generators encode transferable temporal priors and can serve as efficient visual planners for robotics, enabling video-like planning with minimal supervision and data requirements.

Abstract

Generating realistic robotic manipulation videos is an important step toward unifying perception, planning, and action in embodied agents. While existing video diffusion models require large domain-specific datasets and struggle to generalize, recent image generation models trained on language-image corpora exhibit strong compositionality, including the ability to synthesize temporally coherent grid images. This suggests a latent capacity for video-like generation even without explicit temporal modeling. We explore whether such models can serve as visual planners for robots when lightly adapted using LoRA finetuning. We propose a two-part framework that includes: (1) text-conditioned generation, which uses a language instruction and the first frame, and (2) trajectory-conditioned generation, which uses a 2D trajectory overlay and the same initial frame. Experiments on the Jaco Play dataset, Bridge V2, and the RT1 dataset show that both modes produce smooth, coherent robot videos aligned with their respective conditions. Our findings indicate that pretrained image generators encode transferable temporal priors and can function as video-like robotic planners under minimal supervision. Code is released at \href{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}{https://github.com/pangye202264690373/Image-Generation-as-a-Visual-Planner-for-Robotic-Manipulation}.

Paper Structure

This paper contains 34 sections, 17 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of our work. We convert a pretrained image generator into a visual planner that synthesizes $3{\times}3$ manipulation grids under either text or trajectory conditioning.
  • Figure 2: Overall framework of our approach. Our method adapts a pretrained image generator (DiT backbone with LoRA adapters) into a controllable video-like synthesizer that outputs a $3{\times}3$ grid image representing a short manipulation sequence. The upper branch shows the text-conditioned generation, where a language instruction and the first observed frame are encoded by the Text Encoder and VAE respectively. The lower branch shows the trajectory-conditioned generation, where a 2D path is rendered over the first frame and encoded similarly. Both branches share the same DiT architecture with LoRA applied to attention projections. On the right, the data synthesization pipeline illustrates how robot videos are processed: frame sampling, grid assembly, and masking for conditional supervision. On the right, the data synthesization pipeline illustrates how robot videos are processed: frame sampling, grid assembly, and masking for conditional supervision. See fig \ref{['fig:data_synthesis_pipeline']} for details.
  • Figure 3: Data Synthesis Pipeline. From each robot video, nine frames are uniformly sampled and arranged into a $3{\times}3$ grid following a serpentine temporal order ($1{\rightarrow}2{\rightarrow}3$, $6{\leftarrow}5{\leftarrow}4$, $7{\rightarrow}8{\rightarrow}9$). Only the top-left cell remains visible as the conditioning frame, while the other cells are masked to zero. For the trajectory-conditioned variant, a 2D end-effector path is overlaid on the first frame (red$\rightarrow$blue indicating temporal progression). The resulting masked grid serves as the model input, and the complete grid as the reconstruction target for supervised training.
  • Figure 4: Qualitative comparisons between text-conditioned and trajectory-conditioned generation. Each row shows a 9-frame sequence arranged in temporal order. The top example (put the carrot on the cloth) and bottom example (take bowl off plate) illustrate how both conditioning strategies interpret the same initial frame differently. The text-conditioned model relies solely on semantic understanding of the prompt to identify the correct object and its intended motion, while the trajectory-conditioned model follows the spatial path provided by the overlaid end-effector trace.
  • Figure 5: Ablation on the instruction “place the pot behind the green pear.” Top: full models with LoRA and correct conditioning. Bottom: removing LoRA or conditioning causes incoherent or aimless motions, while full models complete the instructed placement.