Off-Policy Reinforcement Learning with High Dimensional Reward

Dong Neuck Lee; Michael R. Kosorok

Off-Policy Reinforcement Learning with High Dimensional Reward

Dong Neuck Lee, Michael R. Kosorok

TL;DR

The paper broadens distributional reinforcement learning to rewards valued in infinite-dimensional Banach spaces, proving that the distributional Bellman operator remains a contraction under the 1-Wasserstein metric and that high-dimensional returns can be faithfully approximated in finite-dimensional Euclidean spaces. It introduces two key components: a theoretical approximation framework that projects Banach-space-valued returns onto finite dimensions with controllable error, and a practical algorithm that uses hypercube-based discretization and the distributional Bellman operator to learn the distribution of returns and optimize a user-defined utility $\phi$. Theoretical contributions include contraction results, Banach-space approximation theorems, and stabilized convergence under Wasserstein metrics, while simulations demonstrate accurate distribution estimation and effective policy search under both known and unknown dynamics. The work enables solving complex, multi-objective, and potentially infinite-dimensional reward problems, with practical implications for domains requiring flexible utilities and high-/infinite-dimensional reward representations.

Abstract

Conventional off-policy reinforcement learning (RL) focuses on maximizing the expected return of scalar rewards. Distributional RL (DRL), in contrast, studies the distribution of returns with the distributional Bellman operator in a Euclidean space, leading to highly flexible choices for utility. This paper establishes robust theoretical foundations for DRL. We prove the contraction property of the Bellman operator even when the reward space is an infinite-dimensional separable Banach space. Furthermore, we demonstrate that the behavior of high- or infinite-dimensional returns can be effectively approximated using a lower-dimensional Euclidean space. Leveraging these theoretical insights, we propose a novel DRL algorithm that tackles problems which have been previously intractable using conventional reinforcement learning approaches.

Off-Policy Reinforcement Learning with High Dimensional Reward

TL;DR

. Theoretical contributions include contraction results, Banach-space approximation theorems, and stabilized convergence under Wasserstein metrics, while simulations demonstrate accurate distribution estimation and effective policy search under both known and unknown dynamics. The work enables solving complex, multi-objective, and potentially infinite-dimensional reward problems, with practical implications for domains requiring flexible utilities and high-/infinite-dimensional reward representations.

Abstract

Paper Structure (16 sections, 10 theorems, 73 equations, 12 figures, 2 algorithms)

This paper contains 16 sections, 10 theorems, 73 equations, 12 figures, 2 algorithms.

Introduction
Setting
Method
Approximation of value distribution
Algorithm based on distributional Bellman operator
Theory
Contraction Property of the Distributional Bellman Operator in Banach Space
Hypercube Approximation Theory
Approximation Theory in Banach Space
Convergence and Wasserstein Metric in Banach Spaces
Alternative Wasserstein Distance
Simulation
Scenario 1
Scenario 2
Scenario 3
...and 1 more sections

Key Result

Theorem 3.1

For any fixed $\epsilon>0$ specified in Assumption assumption_m1, and any closed $A_{\epsilon}\subset\mathbb{B}$ as defined above, the projection $\Tilde{Z}_\epsilon$ satisfies

Figures (12)

Figure 1: Group-level differences in continuous 3D brain activity between antipsychotic-treated and placebo groups in first-episode psychosis patients.
Figure 2: Distance Path in Scenario 1.
Figure 3: Policy 1. Empirical vs. Estimated Distribution by Algorithm 1 in Scenario 1
Figure 4: Policy 2. Empirical vs. Estimated Distribution by Algorithm 1 in Scenario 1
Figure 5: Policy 3. Empirical vs. Estimated Distribution by Algorithm 1 in Scenario 1
...and 7 more figures

Theorems & Definitions (21)

Theorem 3.1
Theorem 3.2
Remark 4.1
Theorem 4.1: K-R Theorem
Lemma 4.2
proof
Theorem 4.3
proof
Lemma 4.4
proof
...and 11 more

Off-Policy Reinforcement Learning with High Dimensional Reward

TL;DR

Abstract

Off-Policy Reinforcement Learning with High Dimensional Reward

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (21)