Table of Contents
Fetching ...

The Max-Min Formulation of Multi-Objective Reinforcement Learning: From Theory to a Model-Free Algorithm

Giseung Park, Woohyeon Byeon, Seongmin Kim, Elad Havakuk, Amir Leshem, Youngchul Sung

TL;DR

This paper addresses fairness in multi-objective reinforcement learning by adopting a max-min criterion over objective returns $J_k(\pi)$, and develops a theory that reformulates the problem via linear programming and convex optimization using state–action visitation frequencies and a weight simplex $\Delta^K$. It introduces an entropy-regularized max-min formulation (P0') to resolve indeterminacy and links the primal policy to a soft-optimal policy through a soft Bellman operator, with a gradient-based, model-free algorithm that alternates soft Q-learning for a given weight and Gaussian-smoothing gradient estimation to update the weights. The approach is shown to be convex in $w$, with P1 and P2 sharing the same optimum, and yields practical, improved max-min performance on tasks including Four-Room, traffic light control, and species conservation, outperforming utilitarian DQN and MDQN baselines. The method has broad implications for fair optimization across multiple objectives in control problems and MARL, enabling explicit balancing of competing goals with scalable, model-free learning.

Abstract

In this paper, we consider multi-objective reinforcement learning, which arises in many real-world problems with multiple optimization goals. We approach the problem with a max-min framework focusing on fairness among the multiple goals and develop a relevant theory and a practical model-free algorithm under the max-min framework. The developed theory provides a theoretical advance in multi-objective reinforcement learning, and the proposed algorithm demonstrates a notable performance improvement over existing baseline methods.

The Max-Min Formulation of Multi-Objective Reinforcement Learning: From Theory to a Model-Free Algorithm

TL;DR

This paper addresses fairness in multi-objective reinforcement learning by adopting a max-min criterion over objective returns , and develops a theory that reformulates the problem via linear programming and convex optimization using state–action visitation frequencies and a weight simplex . It introduces an entropy-regularized max-min formulation (P0') to resolve indeterminacy and links the primal policy to a soft-optimal policy through a soft Bellman operator, with a gradient-based, model-free algorithm that alternates soft Q-learning for a given weight and Gaussian-smoothing gradient estimation to update the weights. The approach is shown to be convex in , with P1 and P2 sharing the same optimum, and yields practical, improved max-min performance on tasks including Four-Room, traffic light control, and species conservation, outperforming utilitarian DQN and MDQN baselines. The method has broad implications for fair optimization across multiple objectives in control problems and MARL, enabling explicit balancing of competing goals with scalable, model-free learning.

Abstract

In this paper, we consider multi-objective reinforcement learning, which arises in many real-world problems with multiple optimization goals. We approach the problem with a max-min framework focusing on fairness among the multiple goals and develop a relevant theory and a practical model-free algorithm under the max-min framework. The developed theory provides a theoretical advance in multi-objective reinforcement learning, and the proposed algorithm demonstrates a notable performance improvement over existing baseline methods.
Paper Structure (35 sections, 10 theorems, 82 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 35 sections, 10 theorems, 82 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Theorem 3.1

For each $s$, $v^{*}_{w}(s)$ is a convex function in $w \in \mathbb{R}^K$. Consequently, the objective function $\mathcal{L}(w) = \sum_s \mu_0(s) v^{*}_{w}(s)$ is also convex in $w \in \mathbb{R}^K$.

Figures (6)

  • Figure 1: Achievable return region and Pareto boundary ($K=2$): weighted sum versus max-min approaches (Due to the equalizer rule zehavi2013weighted, the max-min solution occurs on the line $J_1=J_2$. On the other hand, the maximum sum $J_1+J_2$ occurs on the tangent line with slope -1. Controlling the ratio $\alpha_1/\alpha_2$, we can recover all points on the Pareto boundary by the max-min approach.)
  • Figure 2: (Left) one-state example roijers13survey and (Right) cumulative return vectors of $J(\pi_{sc1}^*), J(\pi_{sc2}^*)$, and $\pi^*_{op}$.
  • Figure 3: Our formulation procedure of the max-min problem.
  • Figure 4: (Up) Four-Room environment felten_toolkit_2023 and (Down) achievable return region in the Four-Room environment (light blue), the unique Pareto optimal point (red dot), and the point our algorithm achieved: $(J_1, J_2) = (0.96, 2.88)$ (green dot).
  • Figure 5: (a) Traffic light control task under consideration, (b) Minimum value of the expected discounted return vector across four dimensions, (c) Expected discounted return for each direction, and (d) Average value of the learned weights of the proposed algorithm. In (c), each black dashed line for each algorithm represents the minimum value of the return across four dimensions.
  • ...and 1 more figures

Theorems & Definitions (25)

  • Theorem 3.1
  • Theorem 3.2
  • proof
  • Theorem 3.3
  • proof
  • Theorem 4.1
  • proof
  • Theorem 4.2
  • proof
  • Theorem 4.3
  • ...and 15 more