Generative Image as Action Models
Mohit Shridhar, Yat Long Lo, Stephen James
TL;DR
Genima recasts action generation for robotics as an image-generation task by fine-tuning Stable Diffusion with ControlNet to draw joint-action targets on RGB observations, then using a Transformer-based controller to execute a sequence of joint positions. Across 25 RLBench tasks and 9 real-world tasks, it outperforms several visuomotor baselines and shows robustness to color, lighting, and distractor perturbations, while approaching the performance of 3D next-best-pose methods despite lacking depth or motion-planners. The approach highlights the potential of internet-pretrained diffusion models to power visuomotor control and suggests that multi-view tiling and careful conditioning enable practical, robust policy learning. Limitations include reliance on camera calibration, slower diffusion-based target generation, and the inherit limitations of behavior cloning, motivating future RL or safety-aware enhancements.
Abstract
Image-generation diffusion models have been fine-tuned to unlock new capabilities such as image-editing and novel view synthesis. Can we similarly unlock image-generation models for visuomotor control? We present GENIMA, a behavior-cloning agent that fine-tunes Stable Diffusion to 'draw joint-actions' as targets on RGB images. These images are fed into a controller that maps the visual targets into a sequence of joint-positions. We study GENIMA on 25 RLBench and 9 real-world manipulation tasks. We find that, by lifting actions into image-space, internet pre-trained diffusion models can generate policies that outperform state-of-the-art visuomotor approaches, especially in robustness to scene perturbations and generalizing to novel objects. Our method is also competitive with 3D agents, despite lacking priors such as depth, keypoints, or motion-planners.
