Design of Restricted Normalizing Flow towards Arbitrary Stochastic Policy with Computational Efficiency
Taisuke Kobayashi, Takumi Aotani
TL;DR
This work tackles the challenge of building expressive yet reliable stochastic policies for real-time RL by introducing restricted normalizing flows (RNF) that yield an analytic policy mean. It shows that constraining the base to be symmetric and the transform to be odd enables analytic mean computation with minimal computational overhead, forming a reliable deployment backend. To recover expressiveness lost by these restrictions, the authors propose Bit-RNF, a bimodal student-t base that supports asymmetric and heavy-tailed behaviors. Empirical results across simulated benchmarks and a real ball-plate manipulation task demonstrate that Bit-RNF achieves faster, more stable learning than baselines and operates within real-time constraints, validating its practical potential for robotic control.
Abstract
This paper proposes a new design method for a stochastic control policy using a normalizing flow (NF). In reinforcement learning (RL), the policy is usually modeled as a distribution model with trainable parameters. When this parameterization has less expressiveness, it would fail to acquiring the optimal policy. A mixture model has capability of a universal approximation, but it with too much redundancy increases the computational cost, which can become a bottleneck when considering the use of real-time robot control. As another approach, NF, which is with additional parameters for invertible transformation from a simple stochastic model as a base, is expected to exert high expressiveness and lower computational cost. However, NF cannot compute its mean analytically due to complexity of the invertible transformation, and it lacks reliability because it retains stochastic behaviors after deployment for robot controller. This paper therefore designs a restricted NF (RNF) that achieves an analytic mean by appropriately restricting the invertible transformation. In addition, the expressiveness impaired by this restriction is regained using bimodal student-t distribution as its base, so-called Bit-RNF. In RL benchmarks, Bit-RNF policy outperformed the previous models. Finally, a real robot experiment demonstrated the applicability of Bit-RNF policy to real world. The attached video is uploaded on youtube: https://youtu.be/R_GJVZDW9bk
