Table of Contents
Fetching ...

Sim-to-Real Dynamic Object Manipulation on Conveyor Systems via Optimization Path Shaping

Zhuoling Li, Jinrong Yang, Yong Zhao, Liangliang Ren, Xiaoyang Wu, Zhenhua Xu, Hengshuang Zhao

TL;DR

The paper tackles generalizable dynamic object manipulation on conveyors by proposing GEM, a geometry-focused imitation-learning policy that prioritizes 3D structure over visual appearance to bridge the sim-to-real gap. GEM uses appearance-noise annealing to shape the optimization trajectory, guiding the network toward geometry-dominated representations, and employs a decomposition of manipulation actions into tracking and interaction components so it can handle objects moving at unseen speeds. The system is trained in Isaac Gym with diverse 3D geometries and four manipulation skills, then evaluated across in-domain/out-of-domain simulations and real-world settings, including a seven-day canteen deployment achieving 97.2% success over 10,000 operations. The work demonstrates strong generalization across backgrounds, motion patterns, unseen objects, and robot embodiments, offering a practical solution for industrial automation with minimal real-world data collection. Overall, GEM advances sim-to-real dynamic manipulation by leveraging geometry-centric representations, a probabilistic action head, memory, and action decomposition to achieve robust, scalable performance in real-world manufacturing contexts.

Abstract

Realizing generalizable dynamic object manipulation on conveyor systems is important for enhancing manufacturing efficiency, as it eliminates specialized engineering for different scenarios. To this end, imitation learning emerges as a promising paradigm, leveraging expert demonstrations to teach a policy manipulation skills. Although the generalization of an imitation learning policy can be improved by increasing demonstrations, demonstration collection is labor-intensive. Besides, public dynamic object manipulation data is scarce. In this work, we address this data scarcity problem via generating demonstrations in a simulator. A significant challenge of using simulated data lies in the appearance gap between simulated and real-world observations. To tackle this challenge, we propose Geometry-Enhanced Model (GEM), which employs our designed appearance noise annealing strategy to shape the policy optimization path, thereby prioritizing the geometry information in observations. Extensive experiments in simulated and real-world tasks demonstrate that GEM can generalize across environment backgrounds, robot embodiments, motion dynamics, and object geometries. Notably, GEM is deployed in a real canteen for tableware collection. Without test-scene data, GEM achieves a success rate of over 97% across more than 10,000 operations.

Sim-to-Real Dynamic Object Manipulation on Conveyor Systems via Optimization Path Shaping

TL;DR

The paper tackles generalizable dynamic object manipulation on conveyors by proposing GEM, a geometry-focused imitation-learning policy that prioritizes 3D structure over visual appearance to bridge the sim-to-real gap. GEM uses appearance-noise annealing to shape the optimization trajectory, guiding the network toward geometry-dominated representations, and employs a decomposition of manipulation actions into tracking and interaction components so it can handle objects moving at unseen speeds. The system is trained in Isaac Gym with diverse 3D geometries and four manipulation skills, then evaluated across in-domain/out-of-domain simulations and real-world settings, including a seven-day canteen deployment achieving 97.2% success over 10,000 operations. The work demonstrates strong generalization across backgrounds, motion patterns, unseen objects, and robot embodiments, offering a practical solution for industrial automation with minimal real-world data collection. Overall, GEM advances sim-to-real dynamic manipulation by leveraging geometry-centric representations, a probabilistic action head, memory, and action decomposition to achieve robust, scalable performance in real-world manufacturing contexts.

Abstract

Realizing generalizable dynamic object manipulation on conveyor systems is important for enhancing manufacturing efficiency, as it eliminates specialized engineering for different scenarios. To this end, imitation learning emerges as a promising paradigm, leveraging expert demonstrations to teach a policy manipulation skills. Although the generalization of an imitation learning policy can be improved by increasing demonstrations, demonstration collection is labor-intensive. Besides, public dynamic object manipulation data is scarce. In this work, we address this data scarcity problem via generating demonstrations in a simulator. A significant challenge of using simulated data lies in the appearance gap between simulated and real-world observations. To tackle this challenge, we propose Geometry-Enhanced Model (GEM), which employs our designed appearance noise annealing strategy to shape the policy optimization path, thereby prioritizing the geometry information in observations. Extensive experiments in simulated and real-world tasks demonstrate that GEM can generalize across environment backgrounds, robot embodiments, motion dynamics, and object geometries. Notably, GEM is deployed in a real canteen for tableware collection. Without test-scene data, GEM achieves a success rate of over 97% across more than 10,000 operations.

Paper Structure

This paper contains 41 sections, 9 equations, 17 figures, 13 tables.

Figures (17)

  • Figure 1: Primarily using demonstrations collected from a simulator, our method can generalize across diverse environment backgrounds, robot embodiments, motion dynamics, and object geometries. Our method has been reliably deployed in a real canteen to conduct tableware collection. Without using demonstrations collected in this canteen, our method achieves a success rate of over 97% in seven consecutive days of operation, performing more than 10,000 tableware collection operations.
  • Figure 2: The 3D geometry of objects usually plays a more important role than 2D appearance in scheduling manipulation actions. As shown, although the left and middle objects share similar color and texture, the appropriate gripper poses to pick them up are different. By contrast, though the middle and right objects belong to different object categories with distinct appearances, they can be picked up with a similar gripper pose.
  • Figure 3: The overall framework of GEM. The policy observation is the colored 3D points captured by RGB-D cameras. The appearance noise annealing strategy is utilized to perturb the color feature in policy observation. Based on this observation, the GEM network predicts a multi-modal action distribution represented as a Gaussian Mixture Model (GMM), a point segmentation mask of the target object to manipulate, and a status flag marking the completion of manipulating one object. We generate interaction actions and tracking actions from the network outputs and combine them to control the robot end-effector.
  • Figure 4: In the appearance noise annealing strategy, strong color perturbation (noise ratio is 1.0) is applied to the input point cloud at the initial phase of training. As the training progresses, the perturbation is gradually removed.
  • Figure 5: This figure illustrates how the predicted tracking actions and interaction actions are integrated to perform dynamic object manipulation. Specifically, tracking actions are responsible for approaching and tracking the moving target object. The interaction actions begin to take effect once stable tracking is achieved, and they are for conducting collision-rich contact with the target object, such as picking and rotating.
  • ...and 12 more figures