Generative Image as Action Models

Mohit Shridhar; Yat Long Lo; Stephen James

Generative Image as Action Models

Mohit Shridhar, Yat Long Lo, Stephen James

TL;DR

Genima recasts action generation for robotics as an image-generation task by fine-tuning Stable Diffusion with ControlNet to draw joint-action targets on RGB observations, then using a Transformer-based controller to execute a sequence of joint positions. Across 25 RLBench tasks and 9 real-world tasks, it outperforms several visuomotor baselines and shows robustness to color, lighting, and distractor perturbations, while approaching the performance of 3D next-best-pose methods despite lacking depth or motion-planners. The approach highlights the potential of internet-pretrained diffusion models to power visuomotor control and suggests that multi-view tiling and careful conditioning enable practical, robust policy learning. Limitations include reliance on camera calibration, slower diffusion-based target generation, and the inherit limitations of behavior cloning, motivating future RL or safety-aware enhancements.

Abstract

Image-generation diffusion models have been fine-tuned to unlock new capabilities such as image-editing and novel view synthesis. Can we similarly unlock image-generation models for visuomotor control? We present GENIMA, a behavior-cloning agent that fine-tunes Stable Diffusion to 'draw joint-actions' as targets on RGB images. These images are fed into a controller that maps the visual targets into a sequence of joint-positions. We study GENIMA on 25 RLBench and 9 real-world manipulation tasks. We find that, by lifting actions into image-space, internet pre-trained diffusion models can generate policies that outperform state-of-the-art visuomotor approaches, especially in robustness to scene perturbations and generalizing to novel objects. Our method is also competitive with 3D agents, despite lacking priors such as depth, keypoints, or motion-planners.

Generative Image as Action Models

TL;DR

Abstract

Paper Structure (60 sections, 15 figures, 6 tables)

This paper contains 60 sections, 15 figures, 6 tables.

Introduction
Genima
Diffusion Agent
Controller
Experiments
Visuomotor and 3D Baselines
Semantic and Spatial Generalization
Ablations and Sensitivity Analyses
Real-robot Evaluations
Related Work
Conclusion and Limitations
RLBench Tasks
Basketball in Hoop
Insert USB in Computer
Move Hanger
...and 45 more sections

Figures (15)

Figure 1: Genima Overview.Genima is a behavior-cloning agent that maps multi-view RGB observations and language goals to joint-position actions. Genima is composed of two stages: (1) SD-Turbo sdturbo is fine-tuned with ControlNet controlnet to draw target joint-positions, which are from the $t+K$ timestep in expert demonstrations. Each joint is rendered as a uniquely colored sphere. (2) The generated targets are input into an ACT mtactzhao2023learning controller, which translates them into a sequence of $K$ joint-positions. The controller is trained to ignore background context by using random backgrounds (see Figure \ref{['fig:rnd_bg']}). Both stages are trained independently and used sequentially during inference.
Figure 2: During training, the controller's input are ground-truth targets with random backgrounds (left). During inference, the targets are from the diffusion agent (right).
Figure 3: Ablations and Sensitivity Analyses. We study factors that affect Genima's performance by training a multi-task agent on 3 tasks: take lid off, open box, and slide block. We report average success rates across the 3 tasks.
Figure 4: Performance drops from Colosseum pumacay2024colosseum perturbations. We evaluate Genima and ACT on 6 perturbation categories: randomized object and part colors, distractor objects, lighting color and brightness variations, randomized table textures, randomized backgrounds, and camera pose changes. We report success rates from 150 evaluation episodes per task, where perturbations are randomly sampled episodically. ACT overfits to objects and lighting conditions, whereas Genima is more robust to such perturbations. See supplementary video for examples.
Figure 5: Spatial Generalization. Train and test saucepan positions (from a top-down view of the tabletop) for evaluations on take lid off. ACT struggles to extrapolate to the upper-right region, whereas Genima uses aligned image-action spaces for better spatial generalization.
...and 10 more figures

Generative Image as Action Models

TL;DR

Abstract

Generative Image as Action Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)