Constrained Sampling to Guide Universal Manipulation RL
Marc Toussaint, Cornelius V. Braun, Eckart Cobo-Briesewitz, Sayantan Auddy, Armand Jordana, Justin Carpentier
TL;DR
This work addresses the challenge of training universal, goal-conditioned manipulation policies in contact-rich settings by injecting structure from model-based constraints into RL. It formalizes a Constrained Goal-conditioned MDP (CG-MDP) where the start/goal distribution $p_0(s,g)$ is defined by differentiable collision, contact, and force constraints, enabling constraint-guided RL without relying on full dynamics models. The authors develop Sample-Guided RL, combining constrained state sampling, zero-order open-loop trajectory optimization, and optional behavior cloning to bias state visitation and accelerate learning. Across minimalistic (double-sphere) and complex (panda arm) domains, constraint-guided sampling and scheduling strategies yield markedly higher final policy performance than baselines, highlighting the value of sampling-based priors over imitation alone for universal manipulation. The results demonstrate that leveraging physical feasibility and interpolating feasible states can produce robust, reactive policies that generalize to diverse goals and starting configurations in manipulation tasks.
Abstract
We consider how model-based solvers can be leveraged to guide training of a universal policy to control from any feasible start state to any feasible goal in a contact-rich manipulation setting. While Reinforcement Learning (RL) has demonstrated its strength in such settings, it may struggle to sufficiently explore and discover complex manipulation strategies, especially in sparse-reward settings. Our approach is based on the idea of a lower-dimensional manifold of feasible, likely-visited states during such manipulation and to guide RL with a sampler from this manifold. We propose Sample-Guided RL, which uses model-based constraint solvers to efficiently sample feasible configurations (satisfying differentiable collision, contact, and force constraints) and leverage them to guide RL for universal (goal-conditioned) manipulation policies. We study using this data directly to bias state visitation, as well as using black-box optimization of open-loop trajectories between random configurations to impose a state bias and optionally add a behavior cloning loss. In a minimalistic double sphere manipulation setting, Sample-Guided RL discovers complex manipulation strategies and achieves high success rates in reaching any statically stable state. In a more challenging panda arm setting, our approach achieves a significant success rate over a near-zero baseline, and demonstrates a breadth of complex whole-body-contact manipulation strategies.
