Table of Contents
Fetching ...

SCOOP'D: Learning Mixed-Liquid-Solid Scooping via Sim2Real Generative Policy

Kuanning Wang, Yongchong Gu, Yuqian Fu, Zeyu Shangguan, Sicheng He, Xiangyang Xue, Yanwei Fu, Daniel Seita

TL;DR

Robotic scooping of mixtures containing liquids and solids is challenging due to complex tool-object interactions and deformable dynamics. The authors propose SCOOP'D, a Sim2Real framework that learns from simulation with privileged state information in OmniGibson to train two diffusion-policy models, $f_\phi$ for pre-scoop pose and $\pi_\theta$ for scooping motions, plus a geometry network $g_\psi$ and perception modules for real-time object pose estimation. A large synthetic dataset, SimScoop, contains 6,480 demonstrations, enabling zero-shot real-world deployment that generalizes across objects, liquids, occlusions, and containers, validated over hundreds of trials. This approach offers scalable, safe, and broadly applicable robotic scooping for assistive, cooking, and environmental-cleanup tasks without requiring real-world fine-tuning.

Abstract

Scooping items with tools such as spoons and ladles is common in daily life, ranging from assistive feeding to retrieving items from environmental disaster sites. However, developing a general and autonomous robotic scooping policy is challenging since it requires reasoning about complex tool-object interactions. Furthermore, scooping often involves manipulating deformable objects, such as granular media or liquids, which is challenging due to their infinite-dimensional configuration spaces and complex dynamics. We propose a method, SCOOP'D, which uses simulation from OmniGibson (built on NVIDIA Omniverse) to collect scooping demonstrations using algorithmic procedures that rely on privileged state information. Then, we use generative policies via diffusion to imitate demonstrations from observational input. We directly apply the learned policy in diverse real-world scenarios, testing its performance on various item quantities, item characteristics, and container types. In zero-shot deployment, our method demonstrates promising results across 465 trials in diverse scenarios, including objects of different difficulty levels that we categorize as "Level 1" and "Level 2." SCOOP'D outperforms all baselines and ablations, suggesting that this is a promising approach to acquiring robotic scooping skills. Project page is at https://scoopdiff.github.io/.

SCOOP'D: Learning Mixed-Liquid-Solid Scooping via Sim2Real Generative Policy

TL;DR

Robotic scooping of mixtures containing liquids and solids is challenging due to complex tool-object interactions and deformable dynamics. The authors propose SCOOP'D, a Sim2Real framework that learns from simulation with privileged state information in OmniGibson to train two diffusion-policy models, for pre-scoop pose and for scooping motions, plus a geometry network and perception modules for real-time object pose estimation. A large synthetic dataset, SimScoop, contains 6,480 demonstrations, enabling zero-shot real-world deployment that generalizes across objects, liquids, occlusions, and containers, validated over hundreds of trials. This approach offers scalable, safe, and broadly applicable robotic scooping for assistive, cooking, and environmental-cleanup tasks without requiring real-world fine-tuning.

Abstract

Scooping items with tools such as spoons and ladles is common in daily life, ranging from assistive feeding to retrieving items from environmental disaster sites. However, developing a general and autonomous robotic scooping policy is challenging since it requires reasoning about complex tool-object interactions. Furthermore, scooping often involves manipulating deformable objects, such as granular media or liquids, which is challenging due to their infinite-dimensional configuration spaces and complex dynamics. We propose a method, SCOOP'D, which uses simulation from OmniGibson (built on NVIDIA Omniverse) to collect scooping demonstrations using algorithmic procedures that rely on privileged state information. Then, we use generative policies via diffusion to imitate demonstrations from observational input. We directly apply the learned policy in diverse real-world scenarios, testing its performance on various item quantities, item characteristics, and container types. In zero-shot deployment, our method demonstrates promising results across 465 trials in diverse scenarios, including objects of different difficulty levels that we categorize as "Level 1" and "Level 2." SCOOP'D outperforms all baselines and ablations, suggesting that this is a promising approach to acquiring robotic scooping skills. Project page is at https://scoopdiff.github.io/.

Paper Structure

This paper contains 37 sections, 1 equation, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Sim2Real Demonstrations and Generalization. Our method, SCOOP'D, is trained entirely in simulation (left) and generalizes to diverse real-world scenarios (middle) and robust conditions (right). The left column shows simulation demonstrations and testing examples. The middle column (I–VI) presents real-world scenes with various objects and environments, where yellow circles denote the target objects. The right column demonstrates the robustness of our learned policy under different disturbances such as human perturbations, lighting changes, and different camera viewpoints.
  • Figure 2: Heuristic scooping strategy. We sketch a ladle and a target item (circle). The ladle's center of rotation is at the bottom of its "bowl." It follows the dotted circular arc to go underneath the item, and then lifts up.
  • Figure 3: Our SCOOP'D Method. The first row shows the heuristic demonstration collection. Using OmniGibson simulation, we leverage an algorithmic demonstrator for SimScoop dataset collection. The second row shows how deployment works. The left part shows how we obtain the state of the target item from text ("meatball"), detection, live video stream segmentation, and regression with the partial point cloud. The middle part shows the pipeline of our method. We use $f_\phi$ to generate a pre-scoop pose based on $\rho$ and $r_{\text{target}}$, then move the ladle directly to the generated pose. Then we leverage $\pi_\theta$ for closed-loop scooping. Our $\pi_\theta$ takes in $\mathbf{p}_\text{relative}$, $\mathbf{v}_\text{target}$, $\mathbf{p}_\text{pre-scoop}$ and $r_\text{target}$, and outputs $\mathbf{a}_t$; $f_\phi$ is executed only once. The right part shows the execution. We demonstrate the execution process in both the top and bottom containers, with the states specifically marked in the bottom for extra clarity.
  • Figure 4: Real-world experimental setup.Left: a third-person view of the setup. A third-person RealSense D435 camera captures RGBD image observations. Middle: we show different ladles, containers, and objects that we use during physical experiments. The robot shown above (to the left) is holding the smallest ladle and operating on the small container (shown in the upper right corner). Right: we show "Level 1" and "Level 2" (i.e., more challenging) objects that we use in our scooping experiments. See Sec. \ref{['ssec:exp_setup']} for more details.
  • Figure 5: Visualization of Scooping. We show our learned policy scooping targets in clutter. We show when it reaches the pre-scoop pose (I), when it moves towards the target (II, III, IV), and finally when it successfully scoops it (V). The first row shows the meatball in a mildly occluded scene, the second row depicts a yellow cube under heavy occlusion in real, and the third row presents a green apple in a severely occluded scene in simulation.
  • ...and 4 more figures