Table of Contents
Fetching ...

Flow-based Domain Randomization for Learning and Sequencing Robotic Skills

Aidan Curtis, Eric Li, Michael Noseworthy, Nishad Gothoskar, Sachin Chitta, Hui Li, Leslie Pack Kaelbling, Nicole Carey

TL;DR

This work addresses the robustness gap in sim-to-real robotic learning by learning domain-randomization distributions with GoFlow, a method that couples a normalizing-flow neural sampler with an entropy-regularized objective to maximize policy performance across diverse environments. The approach yields more flexible, expressive sampling than fixed or simple parametric DR and demonstrates superior domain coverage in six simulated domains and a real gear-insertion task. It also integrates these learned distributions into a belief-space planning framework, using a privileged value function to detect out-of-distribution states and guide information gathering for long-horizon manipulation under partial observability. The results highlight GoFlow’s potential to improve sim-to-real transfer and enable risk-aware, multi-step planning in complex robotics tasks, while acknowledging training variance and the need for careful threshold tuning. Overall, the paper presents a novel, end-to-end framework for adaptive environment sampling and planning under uncertainty with practical robotic impact, including real-world gear insertion.

Abstract

Domain randomization in reinforcement learning is an established technique for increasing the robustness of control policies trained in simulation. By randomizing environment properties during training, the learned policy can become robust to uncertainties along the randomized dimensions. While the environment distribution is typically specified by hand, in this paper we investigate automatically discovering a sampling distribution via entropy-regularized reward maximization of a normalizing-flow-based neural sampling distribution. We show that this architecture is more flexible and provides greater robustness than existing approaches that learn simpler, parameterized sampling distributions, as demonstrated in six simulated and one real-world robotics domain. Lastly, we explore how these learned sampling distributions, combined with a privileged value function, can be used for out-of-distribution detection in an uncertainty-aware multi-step manipulation planner.

Flow-based Domain Randomization for Learning and Sequencing Robotic Skills

TL;DR

This work addresses the robustness gap in sim-to-real robotic learning by learning domain-randomization distributions with GoFlow, a method that couples a normalizing-flow neural sampler with an entropy-regularized objective to maximize policy performance across diverse environments. The approach yields more flexible, expressive sampling than fixed or simple parametric DR and demonstrates superior domain coverage in six simulated domains and a real gear-insertion task. It also integrates these learned distributions into a belief-space planning framework, using a privileged value function to detect out-of-distribution states and guide information gathering for long-horizon manipulation under partial observability. The results highlight GoFlow’s potential to improve sim-to-real transfer and enable risk-aware, multi-step planning in complex robotics tasks, while acknowledging training variance and the need for careful threshold tuning. Overall, the paper presents a novel, end-to-end framework for adaptive environment sampling and planning under uncertainty with practical robotic impact, including real-world gear insertion.

Abstract

Domain randomization in reinforcement learning is an established technique for increasing the robustness of control policies trained in simulation. By randomizing environment properties during training, the learned policy can become robust to uncertainties along the randomized dimensions. While the environment distribution is typically specified by hand, in this paper we investigate automatically discovering a sampling distribution via entropy-regularized reward maximization of a normalizing-flow-based neural sampling distribution. We show that this architecture is more flexible and provides greater robustness than existing approaches that learn simpler, parameterized sampling distributions, as demonstrated in six simulated and one real-world robotics domain. Lastly, we explore how these learned sampling distributions, combined with a privileged value function, can be used for out-of-distribution detection in an uncertainty-aware multi-step manipulation planner.

Paper Structure

This paper contains 34 sections, 20 equations, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: An architecture diagram for our actor-critic RL training setup using a normalizing flow to seed environment parameters across episodes.
  • Figure 2: An illustrative domain showing the learned sampling functions over the space of unobserved parameters for the tested baselines. Compared to other learning methods, GoFlow correctly models the multimodality and inter-variable dependencies of the underlying reward function. This toy domain, along with other domains in our experiments, violates some of the assumptions made by prior works, such as the feasibility of the center point of the range.
  • Figure 3: The coverage ratio over the target distribution across five random seeds for each of the environments. The bands around each curve indicate the standard error.
  • Figure 4: A multi-step manipulation plan using probabilistic pose estimation to estimate and update beliefs over time. The three rows show the robot state $s_t$, the observation $o_t$, and the robot belief $b_t$ at each timestep. The red dotted line in the belief indicates the marginal entropy thresholds for the x, y, and yaw (rotation around z) dimensions as determined by the learned normalizing flow. A belief with entropy surpassing the threshold line indicates the policy will likely fail. For full visualizations of the belief posteriors, flow distributions, and value maps, see Figure \ref{['fig:belief_posteriors']}.
  • Figure 5: A visual example of the precondition computation described in Section \ref{['sec:computing_preconditions']} for the gear assembly plan shown in Figure \ref{['fig:multi-step-plan']}. The two rows show two different projections of the 3D sampling space (x position vs y position in the top row and y position vs yaw rotation in the bottom row). We apply a threshold $\epsilon$ to the sampling distribution to remove low-probability regions (column 1). Additionally, we filter the value function by retaining only the regions where the expected value exceeds a predetermined threshold $\eta$ (column 2). The intersection of these two regions defines the belief-space precondition, indicating where the policy is likely to succeed (column 3). Comparing the precondition to the beliefs, we can see that the belief is not sufficiently contained within the precondition at $t=0$ (column 4), but passes the success threshold $\eta$ at after closer inspection at $t=4$ (column 5).
  • ...and 9 more figures

Theorems & Definitions (1)

  • Remark 1.1