Table of Contents
Fetching ...

Constrained Sampling to Guide Universal Manipulation RL

Marc Toussaint, Cornelius V. Braun, Eckart Cobo-Briesewitz, Sayantan Auddy, Armand Jordana, Justin Carpentier

TL;DR

This work addresses the challenge of training universal, goal-conditioned manipulation policies in contact-rich settings by injecting structure from model-based constraints into RL. It formalizes a Constrained Goal-conditioned MDP (CG-MDP) where the start/goal distribution $p_0(s,g)$ is defined by differentiable collision, contact, and force constraints, enabling constraint-guided RL without relying on full dynamics models. The authors develop Sample-Guided RL, combining constrained state sampling, zero-order open-loop trajectory optimization, and optional behavior cloning to bias state visitation and accelerate learning. Across minimalistic (double-sphere) and complex (panda arm) domains, constraint-guided sampling and scheduling strategies yield markedly higher final policy performance than baselines, highlighting the value of sampling-based priors over imitation alone for universal manipulation. The results demonstrate that leveraging physical feasibility and interpolating feasible states can produce robust, reactive policies that generalize to diverse goals and starting configurations in manipulation tasks.

Abstract

We consider how model-based solvers can be leveraged to guide training of a universal policy to control from any feasible start state to any feasible goal in a contact-rich manipulation setting. While Reinforcement Learning (RL) has demonstrated its strength in such settings, it may struggle to sufficiently explore and discover complex manipulation strategies, especially in sparse-reward settings. Our approach is based on the idea of a lower-dimensional manifold of feasible, likely-visited states during such manipulation and to guide RL with a sampler from this manifold. We propose Sample-Guided RL, which uses model-based constraint solvers to efficiently sample feasible configurations (satisfying differentiable collision, contact, and force constraints) and leverage them to guide RL for universal (goal-conditioned) manipulation policies. We study using this data directly to bias state visitation, as well as using black-box optimization of open-loop trajectories between random configurations to impose a state bias and optionally add a behavior cloning loss. In a minimalistic double sphere manipulation setting, Sample-Guided RL discovers complex manipulation strategies and achieves high success rates in reaching any statically stable state. In a more challenging panda arm setting, our approach achieves a significant success rate over a near-zero baseline, and demonstrates a breadth of complex whole-body-contact manipulation strategies.

Constrained Sampling to Guide Universal Manipulation RL

TL;DR

This work addresses the challenge of training universal, goal-conditioned manipulation policies in contact-rich settings by injecting structure from model-based constraints into RL. It formalizes a Constrained Goal-conditioned MDP (CG-MDP) where the start/goal distribution is defined by differentiable collision, contact, and force constraints, enabling constraint-guided RL without relying on full dynamics models. The authors develop Sample-Guided RL, combining constrained state sampling, zero-order open-loop trajectory optimization, and optional behavior cloning to bias state visitation and accelerate learning. Across minimalistic (double-sphere) and complex (panda arm) domains, constraint-guided sampling and scheduling strategies yield markedly higher final policy performance than baselines, highlighting the value of sampling-based priors over imitation alone for universal manipulation. The results demonstrate that leveraging physical feasibility and interpolating feasible states can produce robust, reactive policies that generalize to diverse goals and starting configurations in manipulation tasks.

Abstract

We consider how model-based solvers can be leveraged to guide training of a universal policy to control from any feasible start state to any feasible goal in a contact-rich manipulation setting. While Reinforcement Learning (RL) has demonstrated its strength in such settings, it may struggle to sufficiently explore and discover complex manipulation strategies, especially in sparse-reward settings. Our approach is based on the idea of a lower-dimensional manifold of feasible, likely-visited states during such manipulation and to guide RL with a sampler from this manifold. We propose Sample-Guided RL, which uses model-based constraint solvers to efficiently sample feasible configurations (satisfying differentiable collision, contact, and force constraints) and leverage them to guide RL for universal (goal-conditioned) manipulation policies. We study using this data directly to bias state visitation, as well as using black-box optimization of open-loop trajectories between random configurations to impose a state bias and optionally add a behavior cloning loss. In a minimalistic double sphere manipulation setting, Sample-Guided RL discovers complex manipulation strategies and achieves high success rates in reaching any statically stable state. In a more challenging panda arm setting, our approach achieves a significant success rate over a near-zero baseline, and demonstrates a breadth of complex whole-body-contact manipulation strategies.
Paper Structure (23 sections, 17 equations, 6 figures, 1 algorithm)

This paper contains 23 sections, 17 equations, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: Random samples from a model-based constrained space that defines the start and goal state distribution of a Constrained Goal-conditional MDP (CG-MDP).
  • Figure 2: Random samples from the double sphere domain.
  • Figure 3: Compute metrics for (a) constrained state sampling and (b) zero-order trajectory optimization.
  • Figure 4: Optimization runs with median (shading: 20/80% quantiles) over 20 runs, for double sphere and panda sphere domains.
  • Figure 5: Avg. episode reward (which is identical to success rate) during RL training. Mean and std. deviation (shading) over 5 independent runs for each method. Note that rewards depend on the current start/goal distribution during training, which in scheduled approaches changes every 100k steps and starts are sampled closer to goals in earlier phases -- explaining decreasing average rewards in such methods and pronounced steps every 100k.
  • ...and 1 more figures