Table of Contents
Fetching ...

C-MORL: Multi-Objective Reinforcement Learning through Efficient Discovery of Pareto Front

Ruohong Liu, Yuxin Pan, Linjie Xu, Lei Song, Jiang Bian, Pengcheng You, Yize Chen

TL;DR

This work tackles multi-objective reinforcement learning (MORL) where user preferences vary and must be accommodated efficiently. It introduces Constrained MORL (C-MORL), a two-stage Pareto-front discovery framework that reframes MORL as a constrained policy optimization problem (CMDP) and uses crowd-distance-based policy selection to extend the front with constrained updates, plus a Policy Assignment mechanism to select optimal surrogates for unseen preferences. The method achieves superior front quality and utility across discrete and continuous tasks—scaling to as many as nine objectives and offering up to 35% higher hypervolume and 9% higher expected utility on benchmarks relative to state-of-the-art baselines. By leveraging an interior-point approach and a linear-time complexity with respect to the number of objectives, C-MORL provides scalable, practical MORL with broad Pareto-front coverage and immediate adaptation to new preferences.

Abstract

Multi-objective reinforcement learning (MORL) excels at handling rapidly changing preferences in tasks that involve multiple criteria, even for unseen preferences. However, previous dominating MORL methods typically generate a fixed policy set or preference-conditioned policy through multiple training iterations exclusively for sampled preference vectors, and cannot ensure the efficient discovery of the Pareto front. Furthermore, integrating preferences into the input of policy or value functions presents scalability challenges, in particular as the dimension of the state and preference space grow, which can complicate the learning process and hinder the algorithm's performance on more complex tasks. To address these issues, we propose a two-stage Pareto front discovery algorithm called Constrained MORL (C-MORL), which serves as a seamless bridge between constrained policy optimization and MORL. Concretely, a set of policies is trained in parallel in the initialization stage, with each optimized towards its individual preference over the multiple objectives. Then, to fill the remaining vacancies in the Pareto front, the constrained optimization steps are employed to maximize one objective while constraining the other objectives to exceed a predefined threshold. Empirically, compared to recent advancements in MORL methods, our algorithm achieves more consistent and superior performances in terms of hypervolume, expected utility, and sparsity on both discrete and continuous control tasks, especially with numerous objectives (up to nine objectives in our experiments).

C-MORL: Multi-Objective Reinforcement Learning through Efficient Discovery of Pareto Front

TL;DR

This work tackles multi-objective reinforcement learning (MORL) where user preferences vary and must be accommodated efficiently. It introduces Constrained MORL (C-MORL), a two-stage Pareto-front discovery framework that reframes MORL as a constrained policy optimization problem (CMDP) and uses crowd-distance-based policy selection to extend the front with constrained updates, plus a Policy Assignment mechanism to select optimal surrogates for unseen preferences. The method achieves superior front quality and utility across discrete and continuous tasks—scaling to as many as nine objectives and offering up to 35% higher hypervolume and 9% higher expected utility on benchmarks relative to state-of-the-art baselines. By leveraging an interior-point approach and a linear-time complexity with respect to the number of objectives, C-MORL provides scalable, practical MORL with broad Pareto-front coverage and immediate adaptation to new preferences.

Abstract

Multi-objective reinforcement learning (MORL) excels at handling rapidly changing preferences in tasks that involve multiple criteria, even for unseen preferences. However, previous dominating MORL methods typically generate a fixed policy set or preference-conditioned policy through multiple training iterations exclusively for sampled preference vectors, and cannot ensure the efficient discovery of the Pareto front. Furthermore, integrating preferences into the input of policy or value functions presents scalability challenges, in particular as the dimension of the state and preference space grow, which can complicate the learning process and hinder the algorithm's performance on more complex tasks. To address these issues, we propose a two-stage Pareto front discovery algorithm called Constrained MORL (C-MORL), which serves as a seamless bridge between constrained policy optimization and MORL. Concretely, a set of policies is trained in parallel in the initialization stage, with each optimized towards its individual preference over the multiple objectives. Then, to fill the remaining vacancies in the Pareto front, the constrained optimization steps are employed to maximize one objective while constraining the other objectives to exceed a predefined threshold. Empirically, compared to recent advancements in MORL methods, our algorithm achieves more consistent and superior performances in terms of hypervolume, expected utility, and sparsity on both discrete and continuous control tasks, especially with numerous objectives (up to nine objectives in our experiments).
Paper Structure (29 sections, 5 theorems, 48 equations, 8 figures, 13 tables, 2 algorithms)

This paper contains 29 sections, 5 theorems, 48 equations, 8 figures, 13 tables, 2 algorithms.

Key Result

Proposition 4.3

Let $\tilde{G}_i$ denote the ascending sorted list for the $i^{th}$ objective values in $P$, and suppose the sorted sequence of the initial point $P_r$ in $\tilde{P}_i$ is $k$. If $d_i \geq \tilde{G}_i(k-1)$ for all $i=1, \ldots, n, \; i\neq l$, then the optimal solution of problem Eq. equ:CMDP is a

Figures (8)

  • Figure 1: Visualization of metrics. (a) Hypervolume, reference point, and example of crowd distance calculation. As an example, the crowd distance of $\pi_b$ is calculated based on the expected return of its neighbors $\pi_a$ and $\pi_c$, as well as the extreme solutions on the two objectives. (b) Given a preference vector, the corresponding expected return is calculated by selecting the set max policy from Pareto front solutions.
  • Figure 2: Visualization of criteria for specifying constraint values. $\pi_r$ denotes initial point. The expected return $G^{\pi_a} (G^{\pi_b})$ of solution $P_a(P_b)$ in objective $1(2)$ is the $(k-1)^{th}$ value in list $\tilde{G}_1(\tilde{G}_2)$, respectively. Therefore, specifying constraints values $d_1\geq G^{\pi_a}$ and $d_2\geq G^{\pi_b}$ is sufficient for the feasible solution of corresponding Eq. \ref{['equ:CMDP']} to be Pareto-optimal solution.
  • Figure 3: Procedure of two-stage C-MORL. Pareto initialization: training several initial policies to derive the initial solution set $\mathcal{X}_{init}$. Pareto extension: iteratively implementing policy selection and Pareto extension with constrained policy optimization toward desired Pareto extension directions in the objective space. Policy assignment: given preference $\boldsymbol{\omega}$, the surrogate execution policy selected from the Pareto set based on Eq. \ref{['eq:SMP']}.
  • Figure 4: Pareto front comparison on two-objective MO-MuJoCo benchmarks.
  • Figure 5: Pareto front comparison on MO-Ant-3d benchmark.
  • ...and 3 more figures

Theorems & Definitions (13)

  • Definition 3.1
  • Definition 3.2
  • Proposition 4.3
  • Proposition 4.4
  • Theorem 4.5
  • Proposition 5.1
  • proof
  • proof
  • proof
  • Lemma D.1
  • ...and 3 more