Learning on One Mode: Addressing Multi-modality in Offline Reinforcement Learning
Mianchu Wang, Yue Jin, Giovanni Montana
TL;DR
Offline RL often struggles with distribution shift and multi-modal data. The paper proposes Learning on One Mode (LOM), which models the offline behaviour with a Gaussian Mixture and uses a hyper-Markov decision process and a hyper Q-function to select the most promising mode per state, followed by weighted imitation learning on that mode. The approach comes with policy-improvement guarantees and achieves state-of-the-art results on the D4RL benchmarks, especially in highly multi-modal regimes. By avoiding full multi-modal distribution modeling, LOM offers a simple yet powerful strategy for leveraging diverse offline data.
Abstract
Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without interacting with the environment. A common challenge is handling multi-modal action distributions, where multiple behaviours are represented in the data. Existing methods often assume unimodal behaviour policies, leading to suboptimal performance when this assumption is violated. We propose weighted imitation Learning on One Mode (LOM), a novel approach that focuses on learning from a single, promising mode of the behaviour policy. By using a Gaussian mixture model to identify modes and selecting the best mode based on expected returns, LOM avoids the pitfalls of averaging over conflicting actions. Theoretically, we show that LOM improves performance while maintaining simplicity in policy learning. Empirically, LOM outperforms existing methods on standard D4RL benchmarks and demonstrates its effectiveness in complex, multi-modal scenarios.
