Table of Contents
Fetching ...

Dynamic Non-Prehensile Object Transport via Model-Predictive Reinforcement Learning

Neel Jawale, Byron Boots, Balakumar Sundaralingam, Mohak Bhardwaj

TL;DR

This work tackles dynamic non-prehensile object transport (the robot waiter) by learning from a small set of task-space demonstrations. It introduces Conservative Value MPC (CV-MPC), which offline-trains an ensemble of end-effector value functions from demonstrations and online uses a pessimistic trajectory-return estimate within a GPU-accelerated MPC to ensure safe, robust planning despite limited data. The approach generalizes to unseen objects and can improve suboptimal demonstrations, achieving strong real-world performance with only 50–100 demonstrations. By integrating offline value learning with online MPC, CV-MPC reduces demonstrator burden and enables rapid, robust learning of dynamic manipulation tasks that rely on contact dynamics and friction constraints. The methodology complements existing MPC frameworks and opens avenues for applying offline-to-online learning to a broader class of dynamic non-prehensile actions.

Abstract

We investigate the problem of teaching a robot manipulator to perform dynamic non-prehensile object transport, also known as the `robot waiter' task, from a limited set of real-world demonstrations. We propose an approach that combines batch reinforcement learning (RL) with model-predictive control (MPC) by pretraining an ensemble of value functions from demonstration data, and utilizing them online within an uncertainty-aware MPC scheme to ensure robustness to limited data coverage. Our approach is straightforward to integrate with off-the-shelf MPC frameworks and enables learning solely from task space demonstrations with sparsely labeled transitions, while leveraging MPC to ensure smooth joint space motions and constraint satisfaction. We validate the proposed approach through extensive simulated and real-world experiments on a Franka Panda robot performing the robot waiter task and demonstrate robust deployment of value functions learned from 50-100 demonstrations. Furthermore, our approach enables generalization to novel objects not seen during training and can improve upon suboptimal demonstrations. We believe that such a framework can reduce the burden of providing extensive demonstrations and facilitate rapid training of robot manipulators to perform non-prehensile manipulation tasks. Project videos and supplementary material can be found at: https://sites.google.com/view/cvmpc.

Dynamic Non-Prehensile Object Transport via Model-Predictive Reinforcement Learning

TL;DR

This work tackles dynamic non-prehensile object transport (the robot waiter) by learning from a small set of task-space demonstrations. It introduces Conservative Value MPC (CV-MPC), which offline-trains an ensemble of end-effector value functions from demonstrations and online uses a pessimistic trajectory-return estimate within a GPU-accelerated MPC to ensure safe, robust planning despite limited data. The approach generalizes to unseen objects and can improve suboptimal demonstrations, achieving strong real-world performance with only 50–100 demonstrations. By integrating offline value learning with online MPC, CV-MPC reduces demonstrator burden and enables rapid, robust learning of dynamic manipulation tasks that rely on contact dynamics and friction constraints. The methodology complements existing MPC frameworks and opens avenues for applying offline-to-online learning to a broader class of dynamic non-prehensile actions.

Abstract

We investigate the problem of teaching a robot manipulator to perform dynamic non-prehensile object transport, also known as the `robot waiter' task, from a limited set of real-world demonstrations. We propose an approach that combines batch reinforcement learning (RL) with model-predictive control (MPC) by pretraining an ensemble of value functions from demonstration data, and utilizing them online within an uncertainty-aware MPC scheme to ensure robustness to limited data coverage. Our approach is straightforward to integrate with off-the-shelf MPC frameworks and enables learning solely from task space demonstrations with sparsely labeled transitions, while leveraging MPC to ensure smooth joint space motions and constraint satisfaction. We validate the proposed approach through extensive simulated and real-world experiments on a Franka Panda robot performing the robot waiter task and demonstrate robust deployment of value functions learned from 50-100 demonstrations. Furthermore, our approach enables generalization to novel objects not seen during training and can improve upon suboptimal demonstrations. We believe that such a framework can reduce the burden of providing extensive demonstrations and facilitate rapid training of robot manipulators to perform non-prehensile manipulation tasks. Project videos and supplementary material can be found at: https://sites.google.com/view/cvmpc.

Paper Structure

This paper contains 16 sections, 10 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 2: (Simulation experiment) Comparison of success rates across ensemble size $K$ and pessimism $\lambda$ for (a) Value Functions and (b) Learned One-Step Cost. For Value Functions, $\lambda = 20$ and $K = 80$ yield the best performance, though results are robust across a range of parameter combinations, reducing the need for extensive fine-tuning. In contrast, the Learned One-Step Cost exhibits lower overall success, with relatively high performance limited to $\lambda = 1$, highlighting its difficulty in handling sparse rewards.
  • Figure 3: (Simulation Experiment) Comparison of bar plots for performance metrics evaluated across 60 trials per algorithm between our proposed CV-MPC approach and several baselines: MPC$_{\text{friction}}$, a demonstrator with accurate knowledge of object properties; MPC$_{\text{biased}}$, a biased demonstrator assuming incorrect object properties (friction coefficient $\mu$); MPC$_{\text{orientation}}$, which employs a high orientation-maintaining cost; and MPC$_{\text{learned\_cost}}$, which learns the one-step friction cost from demonstrations instead of utilizing a value function. The CV-MPC approach performs comparably to the demonstrator with true object properties and surpasses the other baselines, either in terms of success or dynamic behavior.
  • Figure 4: (Simulation Experiment) Success comparison of the demonstrator with a biased value of friction coefficient $\mu$ vs. CV-MPC evaluated across 20 trials per $\mu$. Expert has higher $\mu = 0.6$ than true values shown in different columns. By learning value functions across different settings, CV-MPC can improve over the suboptimal demonstrator.
  • Figure 5: Solid 3D-printed objects with a smooth finish are used for tray-object transport experiments. In Case Study 1, a cube is exclusively used for collecting demonstrations, while other convex objects with varied inertial properties are used for testing. In Case Study 2, 5 poorest-performing objects from Case Study 1 are used for training, and five household objects with diverse materials and inertial properties are tested.
  • Figure 6: (Real-World Experiment) Success rates of CV-MPC on household objects of varied properties reported across 80 total trials per object. After training on 100 demonstrations from the five lowest-performing objects in Case Study 1, CV-MPC achieved high success rates on objects with entirely different shapes and materials, demonstrating its adaptability and efficacy.
  • ...and 4 more figures