Limitations of Scalarisation in MORL: A Comparative Study in Discrete Environments
Muhammad Sa'ood Shah, Asad Jeewa
TL;DR
This paper investigates the limitations of scalarisation-based outer-loop MORL methods in discrete environments by comparing MO Q-Learning with linear and Chebyshev scalarisation against Pareto Q-Learning, an inner-loop multi-policy approach. It demonstrates that scalarisation performance is highly sensitive to the environment and Pareto-front shape, often failing to retain learned solutions and requiring extensive weight configurations that waste computation. In contrast, Pareto Q-Learning shows faster convergence and denser Pareto-front coverage in at least one environment, suggesting inner-loop methods offer more robust decision-making under uncertainty. The findings advocate shifting from outer-loop scalarisation to inner-loop multi-policy strategies to achieve sustainable, generalizable MORL in dynamic settings.
Abstract
Scalarisation functions are widely employed in MORL algorithms to enable intelligent decision-making. However, these functions often struggle to approximate the Pareto front accurately, rendering them unideal in complex, uncertain environments. This study examines selected Multi-Objective Reinforcement Learning (MORL) algorithms across MORL environments with discrete action and observation spaces. We aim to investigate further the limitations associated with scalarisation approaches for decision-making in multi-objective settings. Specifically, we use an outer-loop multi-policy methodology to assess the performance of a seminal single-policy MORL algorithm, MO Q-Learning implemented with linear scalarisation and Chebyshev scalarisation functions. In addition, we explore a pioneering inner-loop multi-policy algorithm, Pareto Q-Learning, which offers a more robust alternative. Our findings reveal that the performance of the scalarisation functions is highly dependent on the environment and the shape of the Pareto front. These functions often fail to retain the solutions uncovered during learning and favour finding solutions in certain regions of the solution space. Moreover, finding the appropriate weight configurations to sample the entire Pareto front is complex, limiting their applicability in uncertain settings. In contrast, inner-loop multi-policy algorithms may provide a more sustainable and generalizable approach and potentially facilitate intelligent decision-making in dynamic and uncertain environments.
