Foundations of Multivariate Distributional Reinforcement Learning
Harley Wiltzer, Jesse Farebrother, Arthur Gretton, Mark Rowland
TL;DR
This paper extends distributional reinforcement learning to multivariate reward signals by proving contraction and convergence for multivariate distributional DP and TD methods under an MMD-based metric with semimetric kernels. It introduces three complementary, tractable approaches: (i) a randomized particle DP that handles representation errors with high-probability guarantees, (ii) a categorical DP with state-dependent finite supports that yields a unique fixed point and provable approximation bounds, and (iii) a signed-measure TD approach enabling affine projections and convergent temporal-difference learning. Together, these results provide a rigorous foundation for learning multivariate return distributions, reveal how fidelity scales with reward dimension and number of atoms, and demonstrate practical viability via simulations and neural-architecture extensions. The framework supports zero-shot evaluation and risk-sensitive policy design in multivariate settings, with clear avenues for scaling to complex, high-dimensional domains using neural approximations and advanced representations.
Abstract
In reinforcement learning (RL), the consideration of multivariate reward signals has led to fundamental advancements in multi-objective decision-making, transfer learning, and representation learning. This work introduces the first oracle-free and computationally-tractable algorithms for provably convergent multivariate distributional dynamic programming and temporal difference learning. Our convergence rates match the familiar rates in the scalar reward setting, and additionally provide new insights into the fidelity of approximate return distribution representations as a function of the reward dimension. Surprisingly, when the reward dimension is larger than $1$, we show that standard analysis of categorical TD learning fails, which we resolve with a novel projection onto the space of mass-$1$ signed measures. Finally, with the aid of our technical results and simulations, we identify tradeoffs between distribution representations that influence the performance of multivariate distributional RL in practice.
