Table of Contents
Fetching ...

Scalable Offline Reinforcement Learning for Mean Field Games

Axel Brunnbauer, Julian Lemmel, Zahra Babaiee, Sophie Neubauer, Radu Grosu

TL;DR

This paper presents Offline Munchausen Mirror Descent (Off-MMD), a novel mean-field RL algorithm that approximates equilibrium policies in mean-field games using purely offline data and incorporates techniques from offline RL to address common issues like Q-value overestimation, ensuring robust policy learning even with limited data coverage.

Abstract

Reinforcement learning algorithms for mean-field games offer a scalable framework for optimizing policies in large populations of interacting agents. Existing methods often depend on online interactions or access to system dynamics, limiting their practicality in real-world scenarios where such interactions are infeasible or difficult to model. In this paper, we present Offline Munchausen Mirror Descent (Off-MMD), a novel mean-field RL algorithm that approximates equilibrium policies in mean-field games using purely offline data. By leveraging iterative mirror descent and importance sampling techniques, Off-MMD estimates the mean-field distribution from static datasets without relying on simulation or environment dynamics. Additionally, we incorporate techniques from offline reinforcement learning to address common issues like Q-value overestimation, ensuring robust policy learning even with limited data coverage. Our algorithm scales to complex environments and demonstrates strong performance on benchmark tasks like crowd exploration or navigation, highlighting its applicability to real-world multi-agent systems where online experimentation is infeasible. We empirically demonstrate the robustness of Off-MMD to low-quality datasets and conduct experiments to investigate its sensitivity to hyperparameter choices.

Scalable Offline Reinforcement Learning for Mean Field Games

TL;DR

This paper presents Offline Munchausen Mirror Descent (Off-MMD), a novel mean-field RL algorithm that approximates equilibrium policies in mean-field games using purely offline data and incorporates techniques from offline RL to address common issues like Q-value overestimation, ensuring robust policy learning even with limited data coverage.

Abstract

Reinforcement learning algorithms for mean-field games offer a scalable framework for optimizing policies in large populations of interacting agents. Existing methods often depend on online interactions or access to system dynamics, limiting their practicality in real-world scenarios where such interactions are infeasible or difficult to model. In this paper, we present Offline Munchausen Mirror Descent (Off-MMD), a novel mean-field RL algorithm that approximates equilibrium policies in mean-field games using purely offline data. By leveraging iterative mirror descent and importance sampling techniques, Off-MMD estimates the mean-field distribution from static datasets without relying on simulation or environment dynamics. Additionally, we incorporate techniques from offline reinforcement learning to address common issues like Q-value overestimation, ensuring robust policy learning even with limited data coverage. Our algorithm scales to complex environments and demonstrates strong performance on benchmark tasks like crowd exploration or navigation, highlighting its applicability to real-world multi-agent systems where online experimentation is infeasible. We empirically demonstrate the robustness of Off-MMD to low-quality datasets and conduct experiments to investigate its sensitivity to hyperparameter choices.

Paper Structure

This paper contains 17 sections, 20 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: (a) Off-MMD can approximate the performance of D-MOMD on the Exploration task when being trained on reasonably good datasets. Training runs were conducted over 10 seeds for 100 iterations of Off-MMD and D-MOMD. We report the mean exploitability and the 95% confidence interval. (b) Evolution of the mean-field over timesteps $t$. Darker areas indicate higher state density.
  • Figure 2: Off-MMD performs best with intermediate and expert quality datasets. Compared to the exploration task, the policy trained on the random behavior dataset performs better. Experiment settings are the same as in \ref{['fig:exploration']}.
  • Figure 3: Exploitability vs. Data-Quality: Each point represents a training run of Off-MMD on a dataset with specific state-action coverage and trajectory quality. The color indicates the exploitability of the policy after 100 iterations with darker colors indicating higher exploitability. For reference, we also mark the datasets used in previous experiments.
  • Figure 4: The left column shows the empirical state distribution of datasets collected by behavior policy $\pi^\beta$ (from the navigation task) with different state coverages. The center column shows the ground-truth mean-field that is generated by the new policy to evaluate and the right column shows the approximated mean-field of that policy using just the dataset. The mean-fields are picked at $t=15$. White spots indicate no state-action coverage in this area.
  • Figure 5: We report the exploitability of policies with different values of $\eta$ in \ref{['eq:loss']}. We train each configuration on 5 datasets that have state-action coverages close to a specific value, ranging from 0.15 to 0.45. We report the mean exploitability of the policies after 100 iterations. For reference, we also plot the performance of the online baseline after 100 iterations.
  • ...and 1 more figures

Theorems & Definitions (4)

  • definition 1
  • definition 2: $\epsilon$-MFNE
  • definition 3: State-Action Coverage
  • definition 4: Trajectory Quality