Table of Contents
Fetching ...

Normalizing Flows are Capable Models for Bi-manual Visuomotor Policy

Jialong Li, Simon Kristoffersson Lind, Wenrui Xie, Maj Stenmark, Volker Krüger

TL;DR

Normalizing Flows Policy (NF-P) is introduced, a conditional normalizing flow-based visuomotor policy for bi-manual manipulation that learns a conditional density over action sequences and enables single-pass generative sampling with tractable likelihood computation.

Abstract

The field of general-purpose robotics has recently embraced powerful probabilistic diffusion-based models to learn the complex embodiment behaviours. However, existing models often come with significant trade-offs, namely high computational costs for inference and a fundamental inability to quantify output uncertainty. We introduce Normalizing Flows Policy (NF-P), a conditional normalizing flow-based visuomotor policy for bi-manual manipulation. NF-P learns a conditional density over action sequences and enables single-pass generative sampling with tractable likelihood computation. Using this property, we propose two inference-time optimization strategies: Stochastic Batch Selection, which selects the highest-likelihood trajectory among sampled candidates, and Gradient Refinement, which directly ascends the log-likelihood to improve action quality. In both simulation and real robot experiments, NF-P achieves promising success rates compared to the baseline. In addition to improved task performance, NF-P demonstrates faster training and lower inference latency. These results establish normalizing flows as a competitive and computationally efficient visuomotor policy, particularly for real-time, uncertainty-aware robotic control.

Normalizing Flows are Capable Models for Bi-manual Visuomotor Policy

TL;DR

Normalizing Flows Policy (NF-P) is introduced, a conditional normalizing flow-based visuomotor policy for bi-manual manipulation that learns a conditional density over action sequences and enables single-pass generative sampling with tractable likelihood computation.

Abstract

The field of general-purpose robotics has recently embraced powerful probabilistic diffusion-based models to learn the complex embodiment behaviours. However, existing models often come with significant trade-offs, namely high computational costs for inference and a fundamental inability to quantify output uncertainty. We introduce Normalizing Flows Policy (NF-P), a conditional normalizing flow-based visuomotor policy for bi-manual manipulation. NF-P learns a conditional density over action sequences and enables single-pass generative sampling with tractable likelihood computation. Using this property, we propose two inference-time optimization strategies: Stochastic Batch Selection, which selects the highest-likelihood trajectory among sampled candidates, and Gradient Refinement, which directly ascends the log-likelihood to improve action quality. In both simulation and real robot experiments, NF-P achieves promising success rates compared to the baseline. In addition to improved task performance, NF-P demonstrates faster training and lower inference latency. These results establish normalizing flows as a competitive and computationally efficient visuomotor policy, particularly for real-time, uncertainty-aware robotic control.

Paper Structure

This paper contains 24 sections, 7 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the Normalizing Flow Policy. During training, the conditional normalizing flow learns a bijective mapping from the complex manifold of robot actions to a simple Gaussian prior $\mathbf{z}$, conditioned on visual observations $\mathbf{o}$. At inference, the learned transformation is used inversely as $T^{-1}(\mathbf{z}|\mathbf{o})$ to generate actions. A key advantage of normalizing flows is the ability to query the exact likelihood of generated actions, enabling two efficient inference optimisation strategies: 1) Stochastic Batch Selection: generate a batch of candidates ($\mathbf{x}_{1 \dots N}$) and select the sample with the highest likelihood, and 2) Gradient Refinement: refine a proposal trajectory by ascending the exact log-likelihood gradient $\nabla_{\mathbf{x}} \log p(\mathbf{x}|\mathbf{o})$ for a small number of steps. The optimised trajectory is then executed by the robot.
  • Figure 2: For a training sample $i$ anchored at current sample $t=6$, we extract a history of observations (blue) and a future trajectory of actions (yellow) using a temporal stride $s$ (here $s=2$, with both the observation and prediction horizon as 4). The sampling window slides by a single frame (shift $=1$) for the next sample ($t=7$). This stride sampling approach ensures that while the internal structure of each sample is sparse (low frequency), the dataset coverage remains dense, allowing the model to initiate plans from any state.
  • Figure 3: Visualization of four example simulation tasks: Stack Block Two, Lift Pot, Beat Block Hammer, and Handover Block.
  • Figure 4: Summary of success rates on the full RoboTwin 2.0 evaluation suite. Bars show the success rate for 100 trials across all tasks for each method (DP, NF-P$_{\text{GR}}$ and NF-P$_{\text{SBS}}$). Normalizing flow policies consistently outperform the diffusion-policy baseline.
  • Figure 5: Real-world experimental setup and task progression. (Left) The dual-arm manipulation environment for the Stack Block Two (top row) and Towel Folding (bottom row) tasks. The wrist and head cameras are not used for both task, the red circle indicates the single scene camera which provides visual observation input for the models. (Right) Key terminal stages of the tasks viewed from the perspective of the scene camera. These stages correspond to the bottleneck evaluation milestones (e.g., picking, placing, and folding).
  • ...and 1 more figures