Table of Contents
Fetching ...

Improving Controller Generalization with Dimensionless Markov Decision Processes

Valentin Charvet, Sebastian Stein, Roderick Murray-Smith

TL;DR

The paper tackles distribution-shift generalization in RL controllers by introducing a dimensionless framework, the $Π$-MDP, that non-dimensionalizes state and action spaces via the Buckingham-$Π$ theorem to achieve equivariance to context changes. It couples a Gaussian Process-based world model with a dimensionless policy search (Pi-PILCO), preserving data efficiency while improving transfer across contexts. The approach is evaluated on pendulum and cartpole systems, showing zero-shot robustness to perturbations in context such as mass and length. The work highlights the value of physics-informed priors for robust RL and outlines practical limitations related to measuring perturbation variables and context dimensionality.

Abstract

Controllers trained with Reinforcement Learning tend to be very specialized and thus generalize poorly when their testing environment differs from their training one. We propose a Model-Based approach to increase generalization where both world model and policy are trained in a dimensionless state-action space. To do so, we introduce the Dimensionless Markov Decision Process ($Π$-MDP): an extension of Contextual-MDPs in which state and action spaces are non-dimensionalized with the Buckingham-$Π$ theorem. This procedure induces policies that are equivariant with respect to changes in the context of the underlying dynamics. We provide a generic framework for this approach and apply it to a model-based policy search algorithm using Gaussian Process models. We demonstrate the applicability of our method on simulated actuated pendulum and cartpole systems, where policies trained on a single environment are robust to shifts in the distribution of the context.

Improving Controller Generalization with Dimensionless Markov Decision Processes

TL;DR

The paper tackles distribution-shift generalization in RL controllers by introducing a dimensionless framework, the -MDP, that non-dimensionalizes state and action spaces via the Buckingham- theorem to achieve equivariance to context changes. It couples a Gaussian Process-based world model with a dimensionless policy search (Pi-PILCO), preserving data efficiency while improving transfer across contexts. The approach is evaluated on pendulum and cartpole systems, showing zero-shot robustness to perturbations in context such as mass and length. The work highlights the value of physics-informed priors for robust RL and outlines practical limitations related to measuring perturbation variables and context dimensionality.

Abstract

Controllers trained with Reinforcement Learning tend to be very specialized and thus generalize poorly when their testing environment differs from their training one. We propose a Model-Based approach to increase generalization where both world model and policy are trained in a dimensionless state-action space. To do so, we introduce the Dimensionless Markov Decision Process (-MDP): an extension of Contextual-MDPs in which state and action spaces are non-dimensionalized with the Buckingham- theorem. This procedure induces policies that are equivariant with respect to changes in the context of the underlying dynamics. We provide a generic framework for this approach and apply it to a model-based policy search algorithm using Gaussian Process models. We demonstrate the applicability of our method on simulated actuated pendulum and cartpole systems, where policies trained on a single environment are robust to shifts in the distribution of the context.

Paper Structure

This paper contains 20 sections, 42 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: High level view of an agent interacting with its environment.
  • Figure 2: Interaction within a $\Pi$-MDP
  • Figure 3: Pendulum success rates on the pole length when both $M$ and $L$ are varying for the natural (left) and dimensionless (right) controllers. Brighter values indicate higher success rates.
  • Figure 4: Cartpole success rates when both parameters $L$ and $M$ change simultaneously. We can see how the dimensionless controller (right) can solve the task on a much wider set of context pairs.
  • Figure 5: Control area (\ref{['eq:control_surface_eq']}) pole length and mass. Higher values on the $x$-axis indicate better generalization.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Remark 1
  • Definition 1: $\Pi$-MDP