Table of Contents
Fetching ...

Generative Image as Action Models

Mohit Shridhar, Yat Long Lo, Stephen James

TL;DR

Genima recasts action generation for robotics as an image-generation task by fine-tuning Stable Diffusion with ControlNet to draw joint-action targets on RGB observations, then using a Transformer-based controller to execute a sequence of joint positions. Across 25 RLBench tasks and 9 real-world tasks, it outperforms several visuomotor baselines and shows robustness to color, lighting, and distractor perturbations, while approaching the performance of 3D next-best-pose methods despite lacking depth or motion-planners. The approach highlights the potential of internet-pretrained diffusion models to power visuomotor control and suggests that multi-view tiling and careful conditioning enable practical, robust policy learning. Limitations include reliance on camera calibration, slower diffusion-based target generation, and the inherit limitations of behavior cloning, motivating future RL or safety-aware enhancements.

Abstract

Image-generation diffusion models have been fine-tuned to unlock new capabilities such as image-editing and novel view synthesis. Can we similarly unlock image-generation models for visuomotor control? We present GENIMA, a behavior-cloning agent that fine-tunes Stable Diffusion to 'draw joint-actions' as targets on RGB images. These images are fed into a controller that maps the visual targets into a sequence of joint-positions. We study GENIMA on 25 RLBench and 9 real-world manipulation tasks. We find that, by lifting actions into image-space, internet pre-trained diffusion models can generate policies that outperform state-of-the-art visuomotor approaches, especially in robustness to scene perturbations and generalizing to novel objects. Our method is also competitive with 3D agents, despite lacking priors such as depth, keypoints, or motion-planners.

Generative Image as Action Models

TL;DR

Genima recasts action generation for robotics as an image-generation task by fine-tuning Stable Diffusion with ControlNet to draw joint-action targets on RGB observations, then using a Transformer-based controller to execute a sequence of joint positions. Across 25 RLBench tasks and 9 real-world tasks, it outperforms several visuomotor baselines and shows robustness to color, lighting, and distractor perturbations, while approaching the performance of 3D next-best-pose methods despite lacking depth or motion-planners. The approach highlights the potential of internet-pretrained diffusion models to power visuomotor control and suggests that multi-view tiling and careful conditioning enable practical, robust policy learning. Limitations include reliance on camera calibration, slower diffusion-based target generation, and the inherit limitations of behavior cloning, motivating future RL or safety-aware enhancements.

Abstract

Image-generation diffusion models have been fine-tuned to unlock new capabilities such as image-editing and novel view synthesis. Can we similarly unlock image-generation models for visuomotor control? We present GENIMA, a behavior-cloning agent that fine-tunes Stable Diffusion to 'draw joint-actions' as targets on RGB images. These images are fed into a controller that maps the visual targets into a sequence of joint-positions. We study GENIMA on 25 RLBench and 9 real-world manipulation tasks. We find that, by lifting actions into image-space, internet pre-trained diffusion models can generate policies that outperform state-of-the-art visuomotor approaches, especially in robustness to scene perturbations and generalizing to novel objects. Our method is also competitive with 3D agents, despite lacking priors such as depth, keypoints, or motion-planners.
Paper Structure (60 sections, 15 figures, 6 tables)

This paper contains 60 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Genima Overview.Genima is a behavior-cloning agent that maps multi-view RGB observations and language goals to joint-position actions. Genima is composed of two stages: (1) SD-Turbo sdturbo is fine-tuned with ControlNet controlnet to draw target joint-positions, which are from the $t+K$ timestep in expert demonstrations. Each joint is rendered as a uniquely colored sphere. (2) The generated targets are input into an ACT mtactzhao2023learning controller, which translates them into a sequence of $K$ joint-positions. The controller is trained to ignore background context by using random backgrounds (see Figure \ref{['fig:rnd_bg']}). Both stages are trained independently and used sequentially during inference.
  • Figure 2: During training, the controller's input are ground-truth targets with random backgrounds (left). During inference, the targets are from the diffusion agent (right).
  • Figure 3: Ablations and Sensitivity Analyses. We study factors that affect Genima's performance by training a multi-task agent on 3 tasks: take lid off, open box, and slide block. We report average success rates across the 3 tasks.
  • Figure 4: Performance drops from Colosseum pumacay2024colosseum perturbations. We evaluate Genima and ACT on 6 perturbation categories: randomized object and part colors, distractor objects, lighting color and brightness variations, randomized table textures, randomized backgrounds, and camera pose changes. We report success rates from 150 evaluation episodes per task, where perturbations are randomly sampled episodically. ACT overfits to objects and lighting conditions, whereas Genima is more robust to such perturbations. See supplementary video for examples.
  • Figure 5: Spatial Generalization. Train and test saucepan positions (from a top-down view of the tabletop) for evaluations on take lid off. ACT struggles to extrapolate to the upper-right region, whereas Genima uses aligned image-action spaces for better spatial generalization.
  • ...and 10 more figures