Breaking the Bias Barrier in Concave Multi-Objective Reinforcement Learning

Swetha Ganesh; Vaneet Aggarwal

Breaking the Bias Barrier in Concave Multi-Objective Reinforcement Learning

Swetha Ganesh, Vaneet Aggarwal

TL;DR

A Natural Policy Gradient algorithm equipped with a multi-level Monte Carlo estimator that controls the bias of the scalarization gradient while maintaining low sampling cost is developed, providing the first optimal sample complexity guarantees for concave multi-objective reinforcement learning under policy-gradient methods.

Abstract

While standard reinforcement learning optimizes a single reward signal, many applications require optimizing a nonlinear utility $f(J_1^π,\dots,J_M^π)$ over multiple objectives, where each $J_m^π$ denotes the expected discounted return of a distinct reward function. A common approach is concave scalarization, which captures important trade-offs such as fairness and risk sensitivity. However, nonlinear scalarization introduces a fundamental challenge for policy gradient methods: the gradient depends on $\partial f(J^π)$, while in practice only empirical return estimates $\hat J$ are available. Because $f$ is nonlinear, the plug-in estimator is biased ($\mathbb{E}[\partial f(\hat J)] \neq \partial f(\mathbb{E}[\hat J])$), leading to persistent gradient bias that degrades sample complexity. In this work we identify and overcome this bias barrier in concave-scalarized multi-objective reinforcement learning. We show that existing policy-gradient methods suffer an intrinsic $\widetilde{\mathcal{O}}(ε^{-4})$ sample complexity due to this bias. To address this issue, we develop a Natural Policy Gradient (NPG) algorithm equipped with a multi-level Monte Carlo (MLMC) estimator that controls the bias of the scalarization gradient while maintaining low sampling cost. We prove that this approach achieves the optimal $\widetilde{\mathcal{O}}(ε^{-2})$ sample complexity for computing an $ε$-optimal policy. Furthermore, we show that when the scalarization function is second-order smooth, the first-order bias cancels automatically, allowing vanilla NPG to achieve the same $\widetilde{\mathcal{O}}(ε^{-2})$ rate without MLMC. Our results provide the first optimal sample complexity guarantees for concave multi-objective reinforcement learning under policy-gradient methods.

Breaking the Bias Barrier in Concave Multi-Objective Reinforcement Learning

TL;DR

Abstract

While standard reinforcement learning optimizes a single reward signal, many applications require optimizing a nonlinear utility

over multiple objectives, where each

denotes the expected discounted return of a distinct reward function. A common approach is concave scalarization, which captures important trade-offs such as fairness and risk sensitivity. However, nonlinear scalarization introduces a fundamental challenge for policy gradient methods: the gradient depends on

, while in practice only empirical return estimates

are available. Because

is nonlinear, the plug-in estimator is biased (

), leading to persistent gradient bias that degrades sample complexity. In this work we identify and overcome this bias barrier in concave-scalarized multi-objective reinforcement learning. We show that existing policy-gradient methods suffer an intrinsic

sample complexity due to this bias. To address this issue, we develop a Natural Policy Gradient (NPG) algorithm equipped with a multi-level Monte Carlo (MLMC) estimator that controls the bias of the scalarization gradient while maintaining low sampling cost. We prove that this approach achieves the optimal

sample complexity for computing an

-optimal policy. Furthermore, we show that when the scalarization function is second-order smooth, the first-order bias cancels automatically, allowing vanilla NPG to achieve the same

rate without MLMC. Our results provide the first optimal sample complexity guarantees for concave multi-objective reinforcement learning under policy-gradient methods.

Paper Structure (41 sections, 12 theorems, 149 equations, 1 table, 2 algorithms)

This paper contains 41 sections, 12 theorems, 149 equations, 1 table, 2 algorithms.

Introduction
Technical overview and novelty.
Main Contributions
Related Works
Concave multi-objective reinforcement learning.
Policy gradient methods for concave utilities.
Reinforcement learning with general utilities.
Problem Setting
Algorithm
Estimating the Policy Gradient
Empirical Return Estimator (Vanilla NPG)
MLMC Return Estimator
Natural Policy Gradient Estimation and Policy Update
Main Results
Assumptions
...and 26 more sections

Key Result

Theorem 1

Let Assumptions assump:concave--assump:trans-comp-error hold. Consider Algorithm alg:MLMC-NPG with and $B=1$. Then

Theorems & Definitions (14)

Theorem 1: MLMC-NPG
Theorem 2: Vanilla NPG under Second-Order Smoothness
Lemma 1: General Framework
Lemma 2: Stationary Convergence
Lemma 3: NPG Estimation Errors
Lemma 4
Lemma 5
proof
Lemma 6
proof
...and 4 more

Breaking the Bias Barrier in Concave Multi-Objective Reinforcement Learning

TL;DR

Abstract

Breaking the Bias Barrier in Concave Multi-Objective Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (14)