Table of Contents
Fetching ...

SAC-MoE: Reinforcement Learning with Mixture-of-Experts for Control of Hybrid Dynamical Systems with Uncertainty

Leroy D'Souza, Akash Karthikeyan, Yash Vardhan Pant, Sebastian Fischmeister

TL;DR

This work tackles control of hybrid dynamical systems with unobserved latent modes and switching locations by introducing SAC-MoE, which casts the SAC actor as a Mixture-of-Experts with a differentiable router. The MoE actor enables adaptive composition of specialized sub-policies without mode labels, while a curriculum learning scheme prioritizes harder contexts to improve generalization. Empirical studies on autonomous racing and Walker2d locomotion show that SAC-MoE achieves superior zero-shot generalization to unseen contexts, with the router's expert activations qualitatively aligning with latent modes. The approach offers a robust, interpretable framework for hybrid-system control with uncertainty, with potential for extension to few-shot adaptation and latent-context encoding.

Abstract

Hybrid dynamical systems result from the interaction of continuous-variable dynamics with discrete events and encompass various systems such as legged robots, vehicles and aircrafts. Challenges arise when the system's modes are characterized by unobservable (latent) parameters and the events that cause system dynamics to switch between different modes are also unobservable. Model-based control approaches typically do not account for such uncertainty in the hybrid dynamics, while standard model-free RL methods fail to account for abrupt mode switches, leading to poor generalization. To overcome this, we propose SAC-MoE which models the actor of the Soft Actor-Critic (SAC) framework as a Mixture-of-Experts (MoE) with a learned router that adaptively selects among learned experts. To further improve robustness, we develop a curriculum-based training algorithm to prioritize data collection in challenging settings, allowing better generalization to unseen modes and switching locations. Simulation studies in hybrid autonomous racing and legged locomotion tasks show that SAC-MoE outperforms baselines (up to 6x) in zero-shot generalization to unseen environments. Our curriculum strategy consistently improves performance across all evaluated policies. Qualitative analysis shows that the interpretable MoE router activates different experts for distinct latent modes.

SAC-MoE: Reinforcement Learning with Mixture-of-Experts for Control of Hybrid Dynamical Systems with Uncertainty

TL;DR

This work tackles control of hybrid dynamical systems with unobserved latent modes and switching locations by introducing SAC-MoE, which casts the SAC actor as a Mixture-of-Experts with a differentiable router. The MoE actor enables adaptive composition of specialized sub-policies without mode labels, while a curriculum learning scheme prioritizes harder contexts to improve generalization. Empirical studies on autonomous racing and Walker2d locomotion show that SAC-MoE achieves superior zero-shot generalization to unseen contexts, with the router's expert activations qualitatively aligning with latent modes. The approach offers a robust, interpretable framework for hybrid-system control with uncertainty, with potential for extension to few-shot adaptation and latent-context encoding.

Abstract

Hybrid dynamical systems result from the interaction of continuous-variable dynamics with discrete events and encompass various systems such as legged robots, vehicles and aircrafts. Challenges arise when the system's modes are characterized by unobservable (latent) parameters and the events that cause system dynamics to switch between different modes are also unobservable. Model-based control approaches typically do not account for such uncertainty in the hybrid dynamics, while standard model-free RL methods fail to account for abrupt mode switches, leading to poor generalization. To overcome this, we propose SAC-MoE which models the actor of the Soft Actor-Critic (SAC) framework as a Mixture-of-Experts (MoE) with a learned router that adaptively selects among learned experts. To further improve robustness, we develop a curriculum-based training algorithm to prioritize data collection in challenging settings, allowing better generalization to unseen modes and switching locations. Simulation studies in hybrid autonomous racing and legged locomotion tasks show that SAC-MoE outperforms baselines (up to 6x) in zero-shot generalization to unseen environments. Our curriculum strategy consistently improves performance across all evaluated policies. Qualitative analysis shows that the interpretable MoE router activates different experts for distinct latent modes.

Paper Structure

This paper contains 13 sections, 6 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Vehicle trajectories (right to left) in an environment with different road surfaces. The blue trajectory shows that switching between separate policies for each mode results in failure. The red trajectory is from our proposed method, SAC-MoE, which can successfully handle such transitions.
  • Figure 2: A visualization of how mode parameters and mode location information are combined to generate various contexts that give rise to the context space, $\mathcal{C}$ in Definition \ref{['def:hcmdp']}. Different contexts can lead to different trajectories for a given policy.
  • Figure 3: (a) Episode returns (goal-seeking reward) over 200 runs for policies in a test environment. (b) Visualization of the test environment (mode 1 in red and mode 2 in green) and selected trajectories for $\pi_{\text{sw}}$ and $\pi_\text{opt}$. (c) Visualization of $\pi_{\text{sw}}$ switching between component policies over a trajectory.
  • Figure 4: SAC-MoE Overview. We adopt an actor-critic framework, where the actor is parameterized as a Mixture-of-Experts model. The encoder produces an embedding which is split into tokens. The router mechanism assigns tokens to different experts and merges their outputs to produce the action distribution.
  • Figure 5: Autonomous racing setup.(A) Training racetrack with $\mathcal{C}$ where each uniquely colored set of regions corresponds to a particular mode's (with value from $\mathcal{L} = \{1.0, 0.5, 0.3\}$) locations. (B,C) Evaluation tracks where colored regions represent surfaces with friction values sampled from predefined ranges, yielding diverse test contexts $\mathcal{C}^\text{test}$. Track 2 is out-of-distribution.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Example 1
  • Definition 1
  • Example 2
  • Remark 1