The Max-Min Formulation of Multi-Objective Reinforcement Learning: From Theory to a Model-Free Algorithm

Giseung Park; Woohyeon Byeon; Seongmin Kim; Elad Havakuk; Amir Leshem; Youngchul Sung

The Max-Min Formulation of Multi-Objective Reinforcement Learning: From Theory to a Model-Free Algorithm

Giseung Park, Woohyeon Byeon, Seongmin Kim, Elad Havakuk, Amir Leshem, Youngchul Sung

TL;DR

This paper addresses fairness in multi-objective reinforcement learning by adopting a max-min criterion over objective returns $J_k(\pi)$, and develops a theory that reformulates the problem via linear programming and convex optimization using state–action visitation frequencies and a weight simplex $\Delta^K$. It introduces an entropy-regularized max-min formulation (P0') to resolve indeterminacy and links the primal policy to a soft-optimal policy through a soft Bellman operator, with a gradient-based, model-free algorithm that alternates soft Q-learning for a given weight and Gaussian-smoothing gradient estimation to update the weights. The approach is shown to be convex in $w$, with P1 and P2 sharing the same optimum, and yields practical, improved max-min performance on tasks including Four-Room, traffic light control, and species conservation, outperforming utilitarian DQN and MDQN baselines. The method has broad implications for fair optimization across multiple objectives in control problems and MARL, enabling explicit balancing of competing goals with scalable, model-free learning.

Abstract

In this paper, we consider multi-objective reinforcement learning, which arises in many real-world problems with multiple optimization goals. We approach the problem with a max-min framework focusing on fairness among the multiple goals and develop a relevant theory and a practical model-free algorithm under the max-min framework. The developed theory provides a theoretical advance in multi-objective reinforcement learning, and the proposed algorithm demonstrates a notable performance improvement over existing baseline methods.

The Max-Min Formulation of Multi-Objective Reinforcement Learning: From Theory to a Model-Free Algorithm

TL;DR

This paper addresses fairness in multi-objective reinforcement learning by adopting a max-min criterion over objective returns

, and develops a theory that reformulates the problem via linear programming and convex optimization using state–action visitation frequencies and a weight simplex

. It introduces an entropy-regularized max-min formulation (P0') to resolve indeterminacy and links the primal policy to a soft-optimal policy through a soft Bellman operator, with a gradient-based, model-free algorithm that alternates soft Q-learning for a given weight and Gaussian-smoothing gradient estimation to update the weights. The approach is shown to be convex in

, with P1 and P2 sharing the same optimum, and yields practical, improved max-min performance on tasks including Four-Room, traffic light control, and species conservation, outperforming utilitarian DQN and MDQN baselines. The method has broad implications for fair optimization across multiple objectives in control problems and MARL, enabling explicit balancing of competing goals with scalable, model-free learning.

Abstract

Paper Structure (35 sections, 10 theorems, 82 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 35 sections, 10 theorems, 82 equations, 6 figures, 5 tables, 1 algorithm.

Introduction and Motivation
Value Iteration as Linear Programming
Max-Min MORL with LP Formulation
Max-Min MORL Formulation
Equivalent Convex Optimization
Regularization for Max-Min Policy
An Example of Indeterminacy
Entropy-Regularized Max-Min Formulation
The Proposed Model-Free Algorithm
Gradient Estimation Based on Gaussian Smoothing
Experiments
Max-Min Performance
Ablation Study
Related Works
Conclusion
...and 20 more sections

Key Result

Theorem 3.1

For each $s$, $v^{*}_{w}(s)$ is a convex function in $w \in \mathbb{R}^K$. Consequently, the objective function $\mathcal{L}(w) = \sum_s \mu_0(s) v^{*}_{w}(s)$ is also convex in $w \in \mathbb{R}^K$.

Figures (6)

Figure 1: Achievable return region and Pareto boundary ($K=2$): weighted sum versus max-min approaches (Due to the equalizer rule zehavi2013weighted, the max-min solution occurs on the line $J_1=J_2$. On the other hand, the maximum sum $J_1+J_2$ occurs on the tangent line with slope -1. Controlling the ratio $\alpha_1/\alpha_2$, we can recover all points on the Pareto boundary by the max-min approach.)
Figure 2: (Left) one-state example roijers13survey and (Right) cumulative return vectors of $J(\pi_{sc1}^*), J(\pi_{sc2}^*)$, and $\pi^*_{op}$.
Figure 3: Our formulation procedure of the max-min problem.
Figure 4: (Up) Four-Room environment felten_toolkit_2023 and (Down) achievable return region in the Four-Room environment (light blue), the unique Pareto optimal point (red dot), and the point our algorithm achieved: $(J_1, J_2) = (0.96, 2.88)$ (green dot).
Figure 5: (a) Traffic light control task under consideration, (b) Minimum value of the expected discounted return vector across four dimensions, (c) Expected discounted return for each direction, and (d) Average value of the learned weights of the proposed algorithm. In (c), each black dashed line for each algorithm represents the minimum value of the return across four dimensions.
...and 1 more figures

Theorems & Definitions (25)

Theorem 3.1
Theorem 3.2
proof
Theorem 3.3
proof
Theorem 4.1
proof
Theorem 4.2
proof
Theorem 4.3
...and 15 more

The Max-Min Formulation of Multi-Objective Reinforcement Learning: From Theory to a Model-Free Algorithm

TL;DR

Abstract

The Max-Min Formulation of Multi-Objective Reinforcement Learning: From Theory to a Model-Free Algorithm

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (25)