Traversing Pareto Optimal Policies: Provably Efficient Multi-Objective Reinforcement Learning

Shuang Qiu; Dake Zhang; Rui Yang; Boxiang Lyu; Tong Zhang

Traversing Pareto Optimal Policies: Provably Efficient Multi-Objective Reinforcement Learning

Shuang Qiu, Dake Zhang, Rui Yang, Boxiang Lyu, Tong Zhang

TL;DR

This work addresses MORL by systematically evaluating optimization targets and identifying Tchebycheff scalarization as a favorable target for traversing all Pareto optimal policies under learner preferences. It reformulates the non-smooth Tchebycheff objective into a min-max-max problem and presents online UCB-based and preference-free algorithms (TchRL and PF-TchRL) with provable $\tilde{O}(\varepsilon^{-2})$ sample complexity per preference. It further extends to a smooth Tchebycheff scalarization (STCH) with STchRL and PF-STchRL, offering improved discrimination between Pareto optimal and weakly Pareto optimal policies and faster weight-update learning rates under certain regimes. Theoretical analyses rely on optimism, concentration bounds, and occupancy-measure arguments, and the results generalize to broader multi-objective stochastic optimization settings. Overall, the paper provides a principled, provably efficient framework to learn and controllably traverse the Pareto front in MORL, with practical implications for environments where preferences over objectives vary or are learned.

Abstract

This paper investigates multi-objective reinforcement learning (MORL), which focuses on learning Pareto optimal policies in the presence of multiple reward functions. Despite MORL's significant empirical success, there is still a lack of satisfactory understanding of various MORL optimization targets and efficient learning algorithms. Our work offers a systematic analysis of several optimization targets to assess their abilities to find all Pareto optimal policies and controllability over learned policies by the preferences for different objectives. We then identify Tchebycheff scalarization as a favorable scalarization method for MORL. Considering the non-smoothness of Tchebycheff scalarization, we reformulate its minimization problem into a new min-max-max optimization problem. Then, for the stochastic policy class, we propose efficient algorithms using this reformulation to learn Pareto optimal policies. We first propose an online UCB-based algorithm to achieve an $\varepsilon$ learning error with an $\tilde{\mathcal{O}}(\varepsilon^{-2})$ sample complexity for a single given preference. To further reduce the cost of environment exploration under different preferences, we propose a preference-free framework that first explores the environment without pre-defined preferences and then generates solutions for any number of preferences. We prove that it only requires an $\tilde{\mathcal{O}}(\varepsilon^{-2})$ exploration complexity in the exploration phase and demands no additional exploration afterward. Lastly, we analyze the smooth Tchebycheff scalarization, an extension of Tchebycheff scalarization, which is proved to be more advantageous in distinguishing the Pareto optimal policies from other weakly Pareto optimal policies based on entry values of preference vectors. Furthermore, we extend our algorithms and theoretical analysis to accommodate this optimization target.

Traversing Pareto Optimal Policies: Provably Efficient Multi-Objective Reinforcement Learning

TL;DR

sample complexity per preference. It further extends to a smooth Tchebycheff scalarization (STCH) with STchRL and PF-STchRL, offering improved discrimination between Pareto optimal and weakly Pareto optimal policies and faster weight-update learning rates under certain regimes. Theoretical analyses rely on optimism, concentration bounds, and occupancy-measure arguments, and the results generalize to broader multi-objective stochastic optimization settings. Overall, the paper provides a principled, provably efficient framework to learn and controllably traverse the Pareto front in MORL, with practical implications for environments where preferences over objectives vary or are learned.

Abstract

learning error with an

sample complexity for a single given preference. To further reduce the cost of environment exploration under different preferences, we propose a preference-free framework that first explores the environment without pre-defined preferences and then generates solutions for any number of preferences. We prove that it only requires an

exploration complexity in the exploration phase and demands no additional exploration afterward. Lastly, we analyze the smooth Tchebycheff scalarization, an extension of Tchebycheff scalarization, which is proved to be more advantageous in distinguishing the Pareto optimal policies from other weakly Pareto optimal policies based on entry values of preference vectors. Furthermore, we extend our algorithms and theoretical analysis to accommodate this optimization target.

Paper Structure (39 sections, 38 theorems, 295 equations, 3 figures, 1 table, 6 algorithms)

This paper contains 39 sections, 38 theorems, 295 equations, 3 figures, 1 table, 6 algorithms.

Introduction
Problem Formulation
Learning Goal of Multi-Objective RL
Optimization Targets for Multi-Objective RL
MORL via Tchebycheff Scalarization
Preference-Free MORL via Tchebycheff Scalarization
Extension to Smooth Tchebycheff Scalarization
Theoretical Analysis
Proof Sketch of Theorem \ref{['thm:tch']}
Proof Sketch of Theorem \ref{['thm:pre-free']}
Proof of Sketch of Theorem \ref{['thm:stch']}
Proof of Sketch of Theorem \ref{['thm:pf-stch']}
Conclusion
Proofs for Section \ref{['sec:pareto']}
Proof of Property \ref{['pro:property']}
...and 24 more sections

Key Result

Proposition 3.5

If for each $\pi\notin\Pi_{\mathrm{P}}^*$, there always exists a Pareto optimal policy $\pi^*\in\Pi_{\mathrm{P}}^*$ such that $V_{i,1}^{\pi}(s_1) < V_{i,1}^{\pi^*}(s_1)$ for all $i\in [m]$, then we have $\Pi_{\mathrm{W}}^* = \Pi_{\mathrm{P}}^*$.

Figures (3)

Figure 1: Example of Scalarization Methods for Stochastic Policy Class. Consider a bi-objective multi-arm bandit problem with stochastic policies, whose reward functions are $r_1(a_1)=0.2, r_1(a_2)=0.8$ and $r_2(a_1)=0.8, r_2(a_2)=0.2$. The red dots in the figures are $(r_1(a_1),r_2(a_1))$ and $(r_1(a_2),r_2(a_2))$. The red lines represent reward values under all stochastic Pareto optimal policies. We show the level sets of different scalarization functions in terms of $(r_1,r_2)$, i.e., Linear scalarization $\sum_{i=1}^2 \lambda_i r_i$, Tchebycheff scalarization $\max_i \lambda_i (0.8+\iota-r_i)$, and smooth Tchebycheff scalarization $\mu\log\sum_{i=1}^2 e^{\frac{\lambda_i (0.8+\iota-r_i)}{\mu}}$. The blue dotted lines are the level sets for the optimal (maximal or minimal) scalarization function values with $(r_1,r_2)$ defined on the red lines. Then, linear scalarization does not differentiate different Pareto values, but (smooth) Tchebycheff scalarization can identify each point by setting different $\bm{\lambda}$, showing better solution controllability.
Figure 2: Illustration of Part 2)
Figure 3: Illustration of Part 3)

Theorems & Definitions (85)

Definition 3.1: Pareto Optimal Policy
Definition 3.3: Weakly Pareto Optimal Policy
Example 3.4
Proposition 3.5
Proposition 4.1
Proposition 4.2
Definition 4.3: Pareto Suboptimality Gap
Proposition 4.4: Equivalent Form of PSG
Proposition 4.5
Definition 4.6: Tchebycheff Scalarization
...and 75 more

Traversing Pareto Optimal Policies: Provably Efficient Multi-Objective Reinforcement Learning

TL;DR

Abstract

Traversing Pareto Optimal Policies: Provably Efficient Multi-Objective Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (85)