Table of Contents
Fetching ...

A Recipe for Stable Offline Multi-agent Reinforcement Learning

Dongsu Lee, Daehee Lee, Amy Zhang

TL;DR

This work analyzes the source of instability in non-linear value decomposition within the offline MARL setting and proposes a simple technique, scale-invariant value normalization (SVN), that stabilizes actor-critic training without altering the Bellman fixed point.

Abstract

Despite remarkable achievements in single-agent offline reinforcement learning (RL), multi-agent RL (MARL) has struggled to adopt this paradigm, largely persisting with on-policy training and self-play from scratch. One reason for this gap comes from the instability of non-linear value decomposition, leading prior works to avoid complex mixing networks in favor of linear value decomposition (e.g., VDN) with value regularization used in single-agent setups. In this work, we analyze the source of instability in non-linear value decomposition within the offline MARL setting. Our observations confirm that they induce value-scale amplification and unstable optimization. To alleviate this, we propose a simple technique, scale-invariant value normalization (SVN), that stabilizes actor-critic training without altering the Bellman fixed point. Empirically, we examine the interaction among key components of offline MARL (e.g., value decomposition, value learning, and policy extraction) and derive a practical recipe that unlocks its full potential.

A Recipe for Stable Offline Multi-agent Reinforcement Learning

TL;DR

This work analyzes the source of instability in non-linear value decomposition within the offline MARL setting and proposes a simple technique, scale-invariant value normalization (SVN), that stabilizes actor-critic training without altering the Bellman fixed point.

Abstract

Despite remarkable achievements in single-agent offline reinforcement learning (RL), multi-agent RL (MARL) has struggled to adopt this paradigm, largely persisting with on-policy training and self-play from scratch. One reason for this gap comes from the instability of non-linear value decomposition, leading prior works to avoid complex mixing networks in favor of linear value decomposition (e.g., VDN) with value regularization used in single-agent setups. In this work, we analyze the source of instability in non-linear value decomposition within the offline MARL setting. Our observations confirm that they induce value-scale amplification and unstable optimization. To alleviate this, we propose a simple technique, scale-invariant value normalization (SVN), that stabilizes actor-critic training without altering the Bellman fixed point. Empirically, we examine the interaction among key components of offline MARL (e.g., value decomposition, value learning, and policy extraction) and derive a practical recipe that unlocks its full potential.
Paper Structure (23 sections, 15 equations, 10 figures, 2 tables)

This paper contains 23 sections, 15 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Revisiting offline RL insights in MARL. (Left) The convex hull denotes the dataset action support, and dots represent actions sampled from the learned policy. BRAC exhibits mode-seeking behavior that extends beyond the dataset support, while AWR remains mode-covering and strictly in-distribution. (Right) Although such mode-seeking would be helpful in single-agent RL, even small out-of-distribution actions induced by BRAC lead to severe performance degradation in MARL, highlighting the sensitivity of joint behavior to individual policy deviations. These results are based on TD learning and hold it regardless of the value decomposition methods (centralized, vdn, and decentralized).
  • Figure 2: Two step matrix game with offline dataset. (Left) The schematic of the didactic example. In $s_1$, Agent A selects between a safe state $s_{2-1}$ with a fixed suboptimal reward of $7$ and a risky state $s_{2-2}$ with an optimal reward of $8$. (Right) The learned joint Q value matrices for each state. The top and bottom rows display the linear method (VDN) and Mixer. Each cell reports the mean Q value and two standard deviations across five random seeds.
  • Figure 3: Divergent dynamics of mixer-based critics. Comparison among the monotonic Mixer, VDN, and individual critics under expert offline data. The mixer induces co-amplification of the Q value (Left) and critical loss (Right), indicating a structural instability of the TD operator.
  • Figure 4: Actor loss miscalibration under value-scale amplification. (Left) Actor loss increases sharply as value-scale drift begins, indicating that the policy objective is dominated by value amplitude rather than advantage structure. (Right) The total gradient norm. This reveals ill-conditioned updates and confirms that the coupled actor and mixer-critic system loses numerical stability.
  • Figure 5: Effect of the simple actor-side remedy. These two signals together demonstrate suppression of value-scale amplification without modifying the TD objective.
  • ...and 5 more figures