Table of Contents
Fetching ...

Foundations of Multivariate Distributional Reinforcement Learning

Harley Wiltzer, Jesse Farebrother, Arthur Gretton, Mark Rowland

TL;DR

This paper extends distributional reinforcement learning to multivariate reward signals by proving contraction and convergence for multivariate distributional DP and TD methods under an MMD-based metric with semimetric kernels. It introduces three complementary, tractable approaches: (i) a randomized particle DP that handles representation errors with high-probability guarantees, (ii) a categorical DP with state-dependent finite supports that yields a unique fixed point and provable approximation bounds, and (iii) a signed-measure TD approach enabling affine projections and convergent temporal-difference learning. Together, these results provide a rigorous foundation for learning multivariate return distributions, reveal how fidelity scales with reward dimension and number of atoms, and demonstrate practical viability via simulations and neural-architecture extensions. The framework supports zero-shot evaluation and risk-sensitive policy design in multivariate settings, with clear avenues for scaling to complex, high-dimensional domains using neural approximations and advanced representations.

Abstract

In reinforcement learning (RL), the consideration of multivariate reward signals has led to fundamental advancements in multi-objective decision-making, transfer learning, and representation learning. This work introduces the first oracle-free and computationally-tractable algorithms for provably convergent multivariate distributional dynamic programming and temporal difference learning. Our convergence rates match the familiar rates in the scalar reward setting, and additionally provide new insights into the fidelity of approximate return distribution representations as a function of the reward dimension. Surprisingly, when the reward dimension is larger than $1$, we show that standard analysis of categorical TD learning fails, which we resolve with a novel projection onto the space of mass-$1$ signed measures. Finally, with the aid of our technical results and simulations, we identify tradeoffs between distribution representations that influence the performance of multivariate distributional RL in practice.

Foundations of Multivariate Distributional Reinforcement Learning

TL;DR

This paper extends distributional reinforcement learning to multivariate reward signals by proving contraction and convergence for multivariate distributional DP and TD methods under an MMD-based metric with semimetric kernels. It introduces three complementary, tractable approaches: (i) a randomized particle DP that handles representation errors with high-probability guarantees, (ii) a categorical DP with state-dependent finite supports that yields a unique fixed point and provable approximation bounds, and (iii) a signed-measure TD approach enabling affine projections and convergent temporal-difference learning. Together, these results provide a rigorous foundation for learning multivariate return distributions, reveal how fidelity scales with reward dimension and number of atoms, and demonstrate practical viability via simulations and neural-architecture extensions. The framework supports zero-shot evaluation and risk-sensitive policy design in multivariate settings, with clear avenues for scaling to complex, high-dimensional domains using neural approximations and advanced representations.

Abstract

In reinforcement learning (RL), the consideration of multivariate reward signals has led to fundamental advancements in multi-objective decision-making, transfer learning, and representation learning. This work introduces the first oracle-free and computationally-tractable algorithms for provably convergent multivariate distributional dynamic programming and temporal difference learning. Our convergence rates match the familiar rates in the scalar reward setting, and additionally provide new insights into the fidelity of approximate return distribution representations as a function of the reward dimension. Surprisingly, when the reward dimension is larger than , we show that standard analysis of categorical TD learning fails, which we resolve with a novel projection onto the space of mass- signed measures. Finally, with the aid of our technical results and simulations, we identify tradeoffs between distribution representations that influence the performance of multivariate distributional RL in practice.
Paper Structure (22 sections, 38 theorems, 108 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 22 sections, 38 theorems, 108 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Let $\rho$ be a semimetric on a space $\mathcal{Y}$ have strong negative type, in the sense that $\int\rho\mathrm{d}([p-q]\times[p-q])<0$ whenever $p\neq q$ are probability measures on a compact set $\mathcal{Y}$. Moreover, let $\kappa:\mathcal{Y}\times\mathcal{Y}\to\mathop{\mathrm{\mathbf{R}}}\noli for some $y_0\in\mathcal{Y}$. Then $\kappa$ is characteristic, so $\mathrm{MMD}_{\kappa}$ is a metr

Figures (7)

  • Figure 1: Distributional SMs and associated predicted return distributions with the categorical (left) and EWP (right) representations. Simplex plots denote the distributional SM. Histograms denote the associated return distributions, predicted from a pair of held-out reward functions.
  • Figure 2: Accuracy of zero-shot return distributions over random MDPs, $d=2$, 95% CI.
  • Figure 3: Accuracy of zero-shot return distributions over random MDPs, $d=3$, 95% CI.
  • Figure 4: Distributional SFs and predicted return distributions with $m=400$ atoms, in a random MDP with known rectangular bound on cumulants. Left: Categorical TD. Right: EWP TD.
  • Figure 5: Example state in the parking environment.
  • ...and 2 more figures

Theorems & Definitions (61)

  • Definition 1
  • Theorem 1: sejdinovicEquivalenceDistancebasedRKHSbased2013
  • Remark 1
  • Theorem 2: Convergent MMD dynamic programming for the multi-return distribution function
  • Theorem 3
  • Proposition 1: Convergence of EWP Dynamic Programming
  • Corollary 1
  • Lemma 1
  • Lemma 2
  • Corollary 2
  • ...and 51 more