Table of Contents
Fetching ...

Multi-objective Reinforcement Learning with Nonlinear Preferences: Provable Approximation for Maximizing Expected Scalarized Return

Nianli Peng, Muhang Tian, Brandon Fain

TL;DR

The paper addresses optimizing nonlinear welfare over multi-objective trajectories by introducing ESR as the objective and deriving an extended Bellman form that conditions on accumulated reward and remaining horizon. It proposes Reward-Aware Value Iteration (RAVI) to compute near-optimal non-stationary policies in pseudopolynomial time for smooth scalarizations with a fixed number of objectives, and extends this to a model-learning setting with RAEE. Theoretical guarantees bound the approximation error and runtime, while experiments on Taxi and Scavenger demonstrate substantial improvements over baselines for several nonlinear welfare functions. This work provides provable guarantees for ESR in MORL and offers a practical path to handling complex preferences in finite-horizon MOMDPs, with potential extensions to stochastic rewards and deeper function approximation. The methodologies advance the principled integration of fairness and risk preferences into multi-objective RL, enabling more nuanced policy optimization in real-world multi-agent and shared-resource settings.

Abstract

We study multi-objective reinforcement learning with nonlinear preferences over trajectories. That is, we maximize the expected value of a nonlinear function over accumulated rewards (expected scalarized return or ESR) in a multi-objective Markov Decision Process (MOMDP). We derive an extended form of Bellman optimality for nonlinear optimization that explicitly considers time and current accumulated reward. Using this formulation, we describe an approximation algorithm for computing an approximately optimal non-stationary policy in pseudopolynomial time for smooth scalarization functions with a constant number of rewards. We prove the approximation analytically and demonstrate the algorithm experimentally, showing that there can be a substantial gap between the optimal policy computed by our algorithm and alternative baselines.

Multi-objective Reinforcement Learning with Nonlinear Preferences: Provable Approximation for Maximizing Expected Scalarized Return

TL;DR

The paper addresses optimizing nonlinear welfare over multi-objective trajectories by introducing ESR as the objective and deriving an extended Bellman form that conditions on accumulated reward and remaining horizon. It proposes Reward-Aware Value Iteration (RAVI) to compute near-optimal non-stationary policies in pseudopolynomial time for smooth scalarizations with a fixed number of objectives, and extends this to a model-learning setting with RAEE. Theoretical guarantees bound the approximation error and runtime, while experiments on Taxi and Scavenger demonstrate substantial improvements over baselines for several nonlinear welfare functions. This work provides provable guarantees for ESR in MORL and offers a practical path to handling complex preferences in finite-horizon MOMDPs, with potential extensions to stochastic rewards and deeper function approximation. The methodologies advance the principled integration of fairness and risk preferences into multi-objective RL, enabling more nuanced policy optimization in real-world multi-agent and shared-resource settings.

Abstract

We study multi-objective reinforcement learning with nonlinear preferences over trajectories. That is, we maximize the expected value of a nonlinear function over accumulated rewards (expected scalarized return or ESR) in a multi-objective Markov Decision Process (MOMDP). We derive an extended form of Bellman optimality for nonlinear optimization that explicitly considers time and current accumulated reward. Using this formulation, we describe an approximation algorithm for computing an approximately optimal non-stationary policy in pseudopolynomial time for smooth scalarization functions with a constant number of rewards. We prove the approximation analytically and demonstrate the algorithm experimentally, showing that there can be a substantial gap between the optimal policy computed by our algorithm and alternative baselines.
Paper Structure (35 sections, 10 theorems, 27 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 35 sections, 10 theorems, 27 equations, 10 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

Let $\mathcal{V}(s, \mathbf{R}(\tau), 0) = W(\mathbf{R}(\tau))$ for all states $s$ and trajectories $\tau$. For every state $s$, history $\tau$, and $t > 0$ time steps remaining, let Then $V^*(s, \tau, t) = \mathcal{V}(s, \mathbf{R}(\tau), t)$.

Figures (10)

  • Figure 1: Taxi Optimization Example
  • Figure 2: Visualization of the Taxi and Scavenger environments.
  • Figure 3: Comparisons with baselines, with learning curves included.
  • Figure 4: Ablation study on discretization factor $\alpha$.
  • Figure 5: Taxi, $W_{\text{Nash}}$
  • ...and 5 more figures

Theorems & Definitions (19)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Lemma 1
  • Definition 6: Recursive Formulation of $V^*$
  • Definition 7
  • Lemma 2: Uniform continuity of multi-objective value function
  • Lemma 3: Approximation Error of RAVI
  • ...and 9 more