Table of Contents
Fetching ...

Zero-Shot Policy Transfer in Reinforcement Learning using Buckingham's Pi Theorem

Francisco Pascoa, Ian Lalonde, Alexandre Girard

TL;DR

The paper tackles the challenge of generalizing reinforcement learning policies across robots, tasks, and environments with differing physical parameters. It introduces a zero-shot transfer method based on Buckingham's Pi Theorem to map observations and actions into a dimensionless space defined by a basis $\beta$, enabling policy scaling from a source context $\mathcal{C}_0$ to target contexts $\mathcal{C}_t$ without retraining. The approach is validated on three environments—simulated pendulum, a real pendulum (sim-to-real), and HalfCheetah—showing that the scaled policy matches the original in dynamically similar contexts and outperforms naive transfers in non-similar contexts, while providing a practical initial guess for further training. These results demonstrate that dimensional analysis can robustly enhance RL policy generalization in robotics, reducing data needs and increasing the operable context volume for a given policy. The work lays a foundation for applying dimensionless policy transfer to more complex systems and for developing context estimators to further improve real-world robustness.

Abstract

Reinforcement learning (RL) policies often fail to generalize to new robots, tasks, or environments with different physical parameters, a challenge that limits their real-world applicability. This paper presents a simple, zero-shot transfer method based on Buckingham's Pi Theorem to address this limitation. The method adapts a pre-trained policy to new system contexts by scaling its inputs (observations) and outputs (actions) through a dimensionless space, requiring no retraining. The approach is evaluated against a naive transfer baseline across three environments of increasing complexity: a simulated pendulum, a physical pendulum for sim-to-real validation, and the high-dimensional HalfCheetah. Results demonstrate that the scaled transfer exhibits no loss of performance on dynamically similar contexts. Furthermore, on non-similar contexts, the scaled policy consistently outperforms the naive transfer, significantly expanding the volume of contexts where the original policy remains effective. These findings demonstrate that dimensional analysis provides a powerful and practical tool to enhance the robustness and generalization of RL policies.

Zero-Shot Policy Transfer in Reinforcement Learning using Buckingham's Pi Theorem

TL;DR

The paper tackles the challenge of generalizing reinforcement learning policies across robots, tasks, and environments with differing physical parameters. It introduces a zero-shot transfer method based on Buckingham's Pi Theorem to map observations and actions into a dimensionless space defined by a basis , enabling policy scaling from a source context to target contexts without retraining. The approach is validated on three environments—simulated pendulum, a real pendulum (sim-to-real), and HalfCheetah—showing that the scaled policy matches the original in dynamically similar contexts and outperforms naive transfers in non-similar contexts, while providing a practical initial guess for further training. These results demonstrate that dimensional analysis can robustly enhance RL policy generalization in robotics, reducing data needs and increasing the operable context volume for a given policy. The work lays a foundation for applying dimensionless policy transfer to more complex systems and for developing context estimators to further improve real-world robustness.

Abstract

Reinforcement learning (RL) policies often fail to generalize to new robots, tasks, or environments with different physical parameters, a challenge that limits their real-world applicability. This paper presents a simple, zero-shot transfer method based on Buckingham's Pi Theorem to address this limitation. The method adapts a pre-trained policy to new system contexts by scaling its inputs (observations) and outputs (actions) through a dimensionless space, requiring no retraining. The approach is evaluated against a naive transfer baseline across three environments of increasing complexity: a simulated pendulum, a physical pendulum for sim-to-real validation, and the high-dimensional HalfCheetah. Results demonstrate that the scaled transfer exhibits no loss of performance on dynamically similar contexts. Furthermore, on non-similar contexts, the scaled policy consistently outperforms the naive transfer, significantly expanding the volume of contexts where the original policy remains effective. These findings demonstrate that dimensional analysis provides a powerful and practical tool to enhance the robustness and generalization of RL policies.

Paper Structure

This paper contains 13 sections, 18 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The scaled policy yields a higher total reward on more contexts than the naive policy on the simulated pendulums. Each dot represents the total reward of the policy in the given context for scaled and naive transfer. The red star marks the original context. The basis used is $\beta = \{m, l, g\}$.
  • Figure 2: The scaled policy remains optimal on the diagonal where similar contexts to the original lie. Each dot represents the total reward of the policy in the given context for scaled and naive transfer. The red star marks the original context. The basis used is $\beta = \{m, l, g\}$.
  • Figure 3: The scaled policy yields a higher total reward on more contexts than the naive policy on a real pendulum. Each dot represents the total reward of the policy in the given context for scaled and naive transfer. The red star marks the original context. Contexts on the main diagonal (front lower left to back upper right) a similar to the original. The basis used is $\beta = \{m, l, g\}$.
  • Figure 4: The scaled policy yields a higher total reward on more contexts than the naive transfer on the HalfCheetah. Each dot represents the total reward of the policy in the given context for scaled and naive transfer. The red star marks the original context. Contexts on the diagonal are perfectly similar to the original. Only the body length $L$ is not similar. $l_0$ is the back thigh length. The basis used is $\beta = \{m, l_0, g\}$.