Table of Contents
Fetching ...

From Prior to Pro: Efficient Skill Mastery via Distribution Contractive RL Finetuning

Zhanyi Sun, Shuran Song

TL;DR

DICE-RL turns a pretrained behavior prior into a high-performing"pro"policy by amplifying high-success behaviors from online feedback, and enables mastery of complex long-horizon manipulation skills directly from high-dimensional pixel inputs, both in simulation and on a real robot.

Abstract

We introduce Distribution Contractive Reinforcement Learning (DICE-RL), a framework that uses reinforcement learning (RL) as a "distribution contraction" operator to refine pretrained generative robot policies. DICE-RL turns a pretrained behavior prior into a high-performing "pro" policy by amplifying high-success behaviors from online feedback. We pretrain a diffusion- or flow-based policy for broad behavioral coverage, then finetune it with a stable, sample-efficient residual off-policy RL framework that combines selective behavior regularization with value-guided action selection. Extensive experiments and analyses show that DICE-RL reliably improves performance with strong stability and sample efficiency. It enables mastery of complex long-horizon manipulation skills directly from high-dimensional pixel inputs, both in simulation and on a real robot. Project website: https://zhanyisun.github.io/dice.rl.2026/.

From Prior to Pro: Efficient Skill Mastery via Distribution Contractive RL Finetuning

TL;DR

DICE-RL turns a pretrained behavior prior into a high-performing"pro"policy by amplifying high-success behaviors from online feedback, and enables mastery of complex long-horizon manipulation skills directly from high-dimensional pixel inputs, both in simulation and on a real robot.

Abstract

We introduce Distribution Contractive Reinforcement Learning (DICE-RL), a framework that uses reinforcement learning (RL) as a "distribution contraction" operator to refine pretrained generative robot policies. DICE-RL turns a pretrained behavior prior into a high-performing "pro" policy by amplifying high-success behaviors from online feedback. We pretrain a diffusion- or flow-based policy for broad behavioral coverage, then finetune it with a stable, sample-efficient residual off-policy RL framework that combines selective behavior regularization with value-guided action selection. Extensive experiments and analyses show that DICE-RL reliably improves performance with strong stability and sample efficiency. It enables mastery of complex long-horizon manipulation skills directly from high-dimensional pixel inputs, both in simulation and on a real robot. Project website: https://zhanyisun.github.io/dice.rl.2026/.
Paper Structure (17 sections, 13 equations, 19 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 13 equations, 19 figures, 3 tables, 1 algorithm.

Figures (19)

  • Figure 1: Distribution Contractive RL (DICE-RL) refines (a) a pretrained generative BC policy (i.e., the behavior prior) into (b) a "pro" policy by contracting the action distribution around successful action modes. DICE-RL leverages the generative behavior cloning policy to achieve (c) controllable exploration for sample-efficient and stable reinforcement learning.
  • Figure 2: Comparisons on Robomimic. Success rate versus online environment steps for Robomimic tasks under RL finetuning. Top row: state observations; bottom row: pixel observations. Curves are averaged over 5 random seeds, with evaluation on 300 held-out test configurations; shaded regions denote variability across seeds.
  • Figure 3: DICE-RL using either Proficient-Human (PH) or Multi-Human (MH) data. Top: success-rate curves. Bottom: finetunability metrics (GoodCov/BadCov/BadEnt).
  • Figure 4: DICE-RL finetuning with pretrained BC checkpoints trained for different numbers of epochs. Left: success-rate curves. Right: finetunability metrics (GoodCov/BadCov/BadEnt).
  • Figure 5: Value improvement ($\Delta V$) vs action entropy reduction ($\Delta H$). Larger gains in value are accompanied by larger drops in action entropy, indicating that RL sharpens the pretrained action distribution to high-value actions.
  • ...and 14 more figures