Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows

Chenyu Yang; Denis Tarasov; Davide Liconti; Hehui Zheng; Robert K. Katzschmann

Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows

Chenyu Yang, Denis Tarasov, Davide Liconti, Hehui Zheng, Robert K. Katzschmann

TL;DR

SOFT-FLOW tackles the challenge of real-world, sample-efficient fine-tuning of dexterous visuomotor policies under limited interaction budgets. It introduces a conditional normalizing-flow policy to model multimodal action chunks with exact likelihoods, paired with an action-chunked critic aligned to the policy's temporal structure. The framework proceeds through four stages—policy imitation initialization, offline critic warm-up, full offline RL, and online RL fine-tuning—leveraging conservative regularization to stabilize updates. Real-world experiments on scissors-tape manipulation and in-hand cube rotation demonstrate stable, sample-efficient adaptation where purely imitation or simulation-based approaches struggle.

Abstract

Real-world fine-tuning of dexterous manipulation policies remains challenging due to limited real-world interaction budgets and highly multimodal action distributions. Diffusion-based policies, while expressive, do not permit conservative likelihood-based updates during fine-tuning because action probabilities are intractable. In contrast, conventional Gaussian policies collapse under multimodality, particularly when actions are executed in chunks, and standard per-step critics fail to align with chunked execution, leading to poor credit assignment. We present SOFT-FLOW, a sample-efficient off-policy fine-tuning framework with normalizing flow (NF) to address these challenges. The normalizing flow policy yields exact likelihoods for multimodal action chunks, allowing conservative, stable policy updates through likelihood regularization and thereby improving sample efficiency. An action-chunked critic evaluates entire action sequences, aligning value estimation with the policy's temporal structure and improving long-horizon credit assignment. To our knowledge, this is the first demonstration of a likelihood-based, multimodal generative policy combined with chunk-level value learning on real robotic hardware. We evaluate SOFT-FLOW on two challenging dexterous manipulation tasks in the real world: cutting tape with scissors retrieved from a case, and in-hand cube rotation with a palm-down grasp -- both of which require precise, dexterous control over long horizons. On these tasks, SOFT-FLOW achieves stable, sample-efficient adaptation where standard methods struggle.

Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows

TL;DR

Abstract

Paper Structure (73 sections, 11 equations, 11 figures, 4 tables, 3 algorithms)

This paper contains 73 sections, 11 equations, 11 figures, 4 tables, 3 algorithms.

Introduction
Related Work
Policy optimization for generative policies
Normalizing flows for RL and IL
Temporal abstraction and action chunking
Offline + online RL
Background
MDP with observations and action chunks
Offline RL and multi-step bootstrapping
Conditional normalizing flows
Method
Setting
Normalizing-flow policy
Action-chunked critic
Algorithm
...and 58 more sections

Figures (11)

Figure 1: Actor--critic architecture of SOFT-FLOW. The actor is a conditional normalizing flow (NF) that models an invertible mapping between action chunks and a base Gaussian distribution. It consists of $K$ stacked NF blocks, each applying affine transformations to a subset of tokens conditioning on observations. In the forward process, sampled actions are mapped to the base distribution, yielding tractable log-likelihoods used for behavior cloning supervision \ref{['eq:loss_il_nf']}. In the reverse direction, latent samples drawn from the Gaussian are transformed into actions through fully differentiable operations, enabling gradient-based policy optimization using the critic. The critic is a transformer-based Q-network that predicts action-chunk values conditioned on observations. Q-values are parameterized using an HL-Gaussian distribution to improve regression stability. To mitigate overestimation bias, the final Q estimate is computed by taking the minimum over multiple critic predictions.
Figure 2: Real-world experimental setup. Left: scissors retrieval and tape cutting with a 7-DoF Franka Panda arm and ORCA hand, using two wrist-mounted RGB cameras and one external workspace camera. Right: in-hand cube reorientation with the ORCA hand, performing continuous palm-down rotations, using single-camera vision-based pose estimation.
Figure 3: Performance evolution on the scissors task with respect to the number of collected demonstrations. We collect 121 teleoperated demonstrations, of which 71 are successful. The policy is first initialized via imitation learning on successful trajectories, followed by offline RL fine-tuning using the full dataset. Subsequently, we iteratively collect 10 online rollouts and fine-tune the policy in an online RL setting. We report the success rates of grasping the scissors (blue) and cutting the tape (orange), measured as the number of successful executions out of 10 rollouts from a fixed test configuration. Offline RL substantially improves the grasp success rate, while online RL gradually increases the overall task success.
Figure 4: Performance evolution of the cube rotation task with regards to real-world data collected during online fine-tuning of SOFT-FLOW. We report rotations per minute (RPM, blue) and cumulative rotation per trajectory (orange). The policy is initialized via teacher–student distillation in massive simulation, trained for 20M steps, and subsequently deployed and fine-tuned online in the real world using SOFT-FLOW. For every 1,000 gradient updates, approximately 15 minutes of real-world interaction data are collected. The first 3,000 updates are used for critic warm-up, after which actor fine-tuning begins. A temporary performance drop is observed at the onset of actor fine-tuning due to increased policy exploration, followed by significant performance improvements as training progresses.
Figure 5: Qualitative rollouts of SOFT-FLOW on real hardware. Top: scissors retrieval and tape cutting task, showing grasp acquisition, lifting, and successful cutting. Bottom: in-hand cube rotation task, illustrating stable grasp maintenance and continuous rotation over time.
...and 6 more figures

Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows

TL;DR

Abstract

Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows

Authors

TL;DR

Abstract

Table of Contents

Figures (11)