S$^2$-Diffusion: Generalizing from Instance-level to Category-level Skills in Robot Manipulation
Quantao Yang, Michael C. Welle, Danica Kragic, Olov Andersson
TL;DR
This work addresses the challenge that imitation-learning policies for robot manipulation often fail to generalize beyond the specific training instances. It introduces S$^2$-Diffusion, an open-vocabulary spatial-semantic diffusion policy that fuses a Grounded-SAM2 semantic segmentation and a DepthAnythingV2 depth predictor to form a spatial-semantic representation, $z = z_f \oplus z_d$, used to condition a diffusion-based visuomotor policy. The method trains a CNN-based diffusion model with RGB inputs and proprioception, leveraging a DDIM-like schedule and the loss $\mathcal{L} = \text{MSE}(a^0, \pi_\theta(a^0+\epsilon^k, o, k))$, enabling generalization from instance-level demonstrations to category-level skills. Across diverse simulated and real-world tasks, S$^2$-Diffusion outperforms baselines, demonstrating robustness to background, texture, and object variations, and requiring only a single RGB camera. This approach advances practical category-level generalization in robotic manipulation with open-vocabulary perception and diffusion-based control.
Abstract
Recent advances in skill learning has propelled robot manipulation to new heights by enabling it to learn complex manipulation tasks from a practical number of demonstrations. However, these skills are often limited to the particular action, object, and environment \textit{instances} that are shown in the training data, and have trouble transferring to other instances of the same category. In this work we present an open-vocabulary Spatial-Semantic Diffusion policy (S$^2$-Diffusion) which enables generalization from instance-level training data to category-level, enabling skills to be transferable between instances of the same category. We show that functional aspects of skills can be captured via a promptable semantic module combined with a spatial representation. We further propose leveraging depth estimation networks to allow the use of only a single RGB camera. Our approach is evaluated and compared on a diverse number of robot manipulation tasks, both in simulation and in the real world. Our results show that S$^2$-Diffusion is invariant to changes in category-irrelevant factors as well as enables satisfying performance on other instances within the same category, even if it was not trained on that specific instance. Project website: https://s2-diffusion.github.io.
