Table of Contents
Fetching ...

S$^2$-Diffusion: Generalizing from Instance-level to Category-level Skills in Robot Manipulation

Quantao Yang, Michael C. Welle, Danica Kragic, Olov Andersson

TL;DR

This work addresses the challenge that imitation-learning policies for robot manipulation often fail to generalize beyond the specific training instances. It introduces S$^2$-Diffusion, an open-vocabulary spatial-semantic diffusion policy that fuses a Grounded-SAM2 semantic segmentation and a DepthAnythingV2 depth predictor to form a spatial-semantic representation, $z = z_f \oplus z_d$, used to condition a diffusion-based visuomotor policy. The method trains a CNN-based diffusion model with RGB inputs and proprioception, leveraging a DDIM-like schedule and the loss $\mathcal{L} = \text{MSE}(a^0, \pi_\theta(a^0+\epsilon^k, o, k))$, enabling generalization from instance-level demonstrations to category-level skills. Across diverse simulated and real-world tasks, S$^2$-Diffusion outperforms baselines, demonstrating robustness to background, texture, and object variations, and requiring only a single RGB camera. This approach advances practical category-level generalization in robotic manipulation with open-vocabulary perception and diffusion-based control.

Abstract

Recent advances in skill learning has propelled robot manipulation to new heights by enabling it to learn complex manipulation tasks from a practical number of demonstrations. However, these skills are often limited to the particular action, object, and environment \textit{instances} that are shown in the training data, and have trouble transferring to other instances of the same category. In this work we present an open-vocabulary Spatial-Semantic Diffusion policy (S$^2$-Diffusion) which enables generalization from instance-level training data to category-level, enabling skills to be transferable between instances of the same category. We show that functional aspects of skills can be captured via a promptable semantic module combined with a spatial representation. We further propose leveraging depth estimation networks to allow the use of only a single RGB camera. Our approach is evaluated and compared on a diverse number of robot manipulation tasks, both in simulation and in the real world. Our results show that S$^2$-Diffusion is invariant to changes in category-irrelevant factors as well as enables satisfying performance on other instances within the same category, even if it was not trained on that specific instance. Project website: https://s2-diffusion.github.io.

S$^2$-Diffusion: Generalizing from Instance-level to Category-level Skills in Robot Manipulation

TL;DR

This work addresses the challenge that imitation-learning policies for robot manipulation often fail to generalize beyond the specific training instances. It introduces S-Diffusion, an open-vocabulary spatial-semantic diffusion policy that fuses a Grounded-SAM2 semantic segmentation and a DepthAnythingV2 depth predictor to form a spatial-semantic representation, , used to condition a diffusion-based visuomotor policy. The method trains a CNN-based diffusion model with RGB inputs and proprioception, leveraging a DDIM-like schedule and the loss , enabling generalization from instance-level demonstrations to category-level skills. Across diverse simulated and real-world tasks, S-Diffusion outperforms baselines, demonstrating robustness to background, texture, and object variations, and requiring only a single RGB camera. This approach advances practical category-level generalization in robotic manipulation with open-vocabulary perception and diffusion-based control.

Abstract

Recent advances in skill learning has propelled robot manipulation to new heights by enabling it to learn complex manipulation tasks from a practical number of demonstrations. However, these skills are often limited to the particular action, object, and environment \textit{instances} that are shown in the training data, and have trouble transferring to other instances of the same category. In this work we present an open-vocabulary Spatial-Semantic Diffusion policy (S-Diffusion) which enables generalization from instance-level training data to category-level, enabling skills to be transferable between instances of the same category. We show that functional aspects of skills can be captured via a promptable semantic module combined with a spatial representation. We further propose leveraging depth estimation networks to allow the use of only a single RGB camera. Our approach is evaluated and compared on a diverse number of robot manipulation tasks, both in simulation and in the real world. Our results show that S-Diffusion is invariant to changes in category-irrelevant factors as well as enables satisfying performance on other instances within the same category, even if it was not trained on that specific instance. Project website: https://s2-diffusion.github.io.

Paper Structure

This paper contains 14 sections, 4 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Our Spatial-Semantic Diffusion policy (S$^2$-Diffusion) not only efficiently completes the task at hand but also enables the generalization of the same skill across diverse contexts and task variations.
  • Figure 2: Skill abstraction hierarchy for flipping and scooping tasks.
  • Figure 3: S$^2$-Diffusion Architecture. The architecture is composed of three components: a pretrained semantic segmentation model Grounded-SAM2ren2024grounded, a pretrained depth prediction model DepthAnythingV2yang2024depth and a U-Net denoising diffusion policy chi2023diffusionpolicy. We design an object-aware spatial-semantic representation that is leveraged for denoising probabilistic model.
  • Figure 4: Simulated Tasks. We perform evaluations on six single-stage tasks from a large-scale simulation framework RoboCasa robocasa2024: ServeMug, CloseDoor, TurnOnMicrowave, TurnOffFaucet, MoveSoda, TurnOnStove, and two tasks in SAPIEN simulator: HangMug, InsertPencil.
  • Figure 5: Comparison of our S$^2$-Diffusion and the baseline on two real-world environments: whiteboard wiping and bowl-to-bowl scooping. S$^2$-Diffusion and the baseline are trained on red-whiteboard-wiping dataset and rice-bowl-to-bowl-scooping dataset respectively, then evaluated on the known instances and transferred to unseen instances of the two tasks. Note that for choco-cereal-btb-scooping, hearts-cereal-btb-scooping, mixed-cereal-btb-scooping, and green-whiteboard-wiping the baseline diffusion policy shows $0\%$ success rate.
  • ...and 2 more figures