Table of Contents
Fetching ...

Equivariant Ensembles and Regularization for Reinforcement Learning in Map-based Path Planning

Mirco Theile, Hongpeng Cao, Marco Caccamo, Alberto L. Sangiovanni-Vincentelli

TL;DR

This work addresses how to exploit environmental symmetries in reinforcement learning without constraining neural architectures. It introduces equivariant ensembles that average policies and value functions over symmetry transformations to guarantee equivariance and invariance, and augments this with regularization to bias networks toward symmetry during training. The approach is demonstrated on a long-horizon map-based UAV CPP task, showing improved sample efficiency, faster training, and better generalization, including to rotated and out-of-distribution maps. The findings suggest that combining ensemble-based symmetry with targeted regularization yields practical benefits for symmetry-rich RL problems and can be extended to broader domains with similar invariances.

Abstract

In reinforcement learning (RL), exploiting environmental symmetries can significantly enhance efficiency, robustness, and performance. However, ensuring that the deep RL policy and value networks are respectively equivariant and invariant to exploit these symmetries is a substantial challenge. Related works try to design networks that are equivariant and invariant by construction, limiting them to a very restricted library of components, which in turn hampers the expressiveness of the networks. This paper proposes a method to construct equivariant policies and invariant value functions without specialized neural network components, which we term equivariant ensembles. We further add a regularization term for adding inductive bias during training. In a map-based path planning case study, we show how equivariant ensembles and regularization benefit sample efficiency and performance.

Equivariant Ensembles and Regularization for Reinforcement Learning in Map-based Path Planning

TL;DR

This work addresses how to exploit environmental symmetries in reinforcement learning without constraining neural architectures. It introduces equivariant ensembles that average policies and value functions over symmetry transformations to guarantee equivariance and invariance, and augments this with regularization to bias networks toward symmetry during training. The approach is demonstrated on a long-horizon map-based UAV CPP task, showing improved sample efficiency, faster training, and better generalization, including to rotated and out-of-distribution maps. The findings suggest that combining ensemble-based symmetry with targeted regularization yields practical benefits for symmetry-rich RL problems and can be extended to broader domains with similar invariances.

Abstract

In reinforcement learning (RL), exploiting environmental symmetries can significantly enhance efficiency, robustness, and performance. However, ensuring that the deep RL policy and value networks are respectively equivariant and invariant to exploit these symmetries is a substantial challenge. Related works try to design networks that are equivariant and invariant by construction, limiting them to a very restricted library of components, which in turn hampers the expressiveness of the networks. This paper proposes a method to construct equivariant policies and invariant value functions without specialized neural network components, which we term equivariant ensembles. We further add a regularization term for adding inductive bias during training. In a map-based path planning case study, we show how equivariant ensembles and regularization benefit sample efficiency and performance.
Paper Structure (24 sections, 27 equations, 5 figures, 2 tables)

This paper contains 24 sections, 27 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1:
  • Figure 2: Example state of a UAV in a coverage path planning grid-world problem on the left, showing the covered area, trajectory, and field of view, with a legend on the right.
  • Figure 3: Neural network architecture where the position is a one-hot representation in the map, and the battery and landing state are fed in as state scalars.
  • Figure 4: All maps used during training and testing.
  • Figure 5: Training curves of all algorithms showing the task-solved ratio throughout the first 20M steps and the average steps to solve the scenarios for the full 100M training steps.

Theorems & Definitions (2)

  • proof
  • proof