Table of Contents
Fetching ...

An Online Multiobjective Policy Gradient for Long-run Average-reward Markov Decision Process

Rahul Misra, Manuela L. Bujorianu, Rafał Wisniewski

TL;DR

This work tackles multi-objective reinforcement learning for long-run average rewards in the presence of adversarial disturbances by steering the time-averaged reward vector $\bar{\mathbf{r}}$ to a target set $T$ using Blackwell's Approachability. It introduces an online two-timescale method where an inner policy-gradient/actor-critic loop optimizes a scalarized objective $\langle \bar{\mathbf{r}}(\pi), \lambda \rangle$ with function approximation, and an outer loop updates the scalarization vector $\lambda$ by projecting $\bar{\mathbf{r}}$ onto $T$, thereby enforcing the Blackwell condition. Theoretical convergence to $T$ is established under ergodicity via Kac's theorem and the Green/Poisson framework, and a numerical toy demonstrates robust convergence despite worst-case disturbances. This work advances multi-objective RL by providing asymptotic guarantees with an online, model-free, function-approximation-based approach and suggests directions for extending to more players and improving sample efficiency.

Abstract

We propose a reinforcement learning (RL) framework for multi-objective decision-making, where the agent seeks to optimize a vector of rewards rather than a single scalar value. The objective is to ensure that the time-averaged reward vector converges asymptotically to a predefined target set. Since standard RL algorithms operate on scalar rewards, we introduce a dynamic scalarization mechanism guided by Blackwell's Approachability Theorem. This theorem enables adaptive updates of the scalarization vector to guarantee convergence toward the target set. Assuming ergodicity, the Markov chain induced by the learned policies admits a stationary distribution, ensuring all states recur with finite return times. Our algorithm exploits this property by defining an inner loop that applies a policy gradient method (with baseline) between successive visits to a designated recurrent state, enforcing Blackwell's condition at each iteration. An outer loop then updates the scalarization vector after each recurrence. We establish theoretical convergence of the long-run average reward vector to the target set and validate the approach through a numerical example.

An Online Multiobjective Policy Gradient for Long-run Average-reward Markov Decision Process

TL;DR

This work tackles multi-objective reinforcement learning for long-run average rewards in the presence of adversarial disturbances by steering the time-averaged reward vector to a target set using Blackwell's Approachability. It introduces an online two-timescale method where an inner policy-gradient/actor-critic loop optimizes a scalarized objective with function approximation, and an outer loop updates the scalarization vector by projecting onto , thereby enforcing the Blackwell condition. Theoretical convergence to is established under ergodicity via Kac's theorem and the Green/Poisson framework, and a numerical toy demonstrates robust convergence despite worst-case disturbances. This work advances multi-objective RL by providing asymptotic guarantees with an online, model-free, function-approximation-based approach and suggests directions for extending to more players and improving sample efficiency.

Abstract

We propose a reinforcement learning (RL) framework for multi-objective decision-making, where the agent seeks to optimize a vector of rewards rather than a single scalar value. The objective is to ensure that the time-averaged reward vector converges asymptotically to a predefined target set. Since standard RL algorithms operate on scalar rewards, we introduce a dynamic scalarization mechanism guided by Blackwell's Approachability Theorem. This theorem enables adaptive updates of the scalarization vector to guarantee convergence toward the target set. Assuming ergodicity, the Markov chain induced by the learned policies admits a stationary distribution, ensuring all states recur with finite return times. Our algorithm exploits this property by defining an inner loop that applies a policy gradient method (with baseline) between successive visits to a designated recurrent state, enforcing Blackwell's condition at each iteration. An outer loop then updates the scalarization vector after each recurrence. We establish theoretical convergence of the long-run average reward vector to the target set and validate the approach through a numerical example.

Paper Structure

This paper contains 6 sections, 4 theorems, 29 equations, 2 figures, 1 algorithm.

Key Result

Theorem 1

The gradient of average reward is given by the following expression, where $\delta(x,u)$ is the TD error calculated at time $t$ as follows, Here $\hat{V}$ and $\hat{g}$ is approximated via gradient descent with a compatible function approximation that minimizes error (see eq:Value_funcation_approx, eq:TD_update).

Figures (2)

  • Figure 1: The long run average reward vector (in pink color) Approaches the target set $T$ despite the adversary choosing worst-case points for the policies ($u^1_0$, $u^1_1$, $u^1_2$) selected by Player $1$.
  • Figure :

Theorems & Definitions (8)

  • Definition II.1
  • Remark 1
  • Remark 2
  • Theorem 1: Policy Gradient Theorem sutton1999policysutton2018reinforcement
  • Theorem 2: Kac's Theorem meyn2012markov
  • Theorem 3: Approachability Theorem for Markov Games shimkin1993guaranteed
  • Proposition 4
  • proof : Sketch of proof