Table of Contents
Fetching ...

MOMA-AC: A preference-driven actor-critic framework for continuous multi-objective multi-agent reinforcement learning

Adam Callaghan, Karl Mason, Patrick Mannion

TL;DR

The first dedicated inner-loop actor-critic framework for continuous state and action spaces: Multi-Objective Multi-Agent Actor-Critic (MOMA-AC) is introduced, establishing this framework as a foundational step towards robust, scalable multi-objective policy learning in continuous multi-agent domains.

Abstract

This paper addresses a critical gap in Multi-Objective Multi-Agent Reinforcement Learning (MOMARL) by introducing the first dedicated inner-loop actor-critic framework for continuous state and action spaces: Multi-Objective Multi-Agent Actor-Critic (MOMA-AC). Building on single-objective, single-agent algorithms, we instantiate this framework with Twin Delayed Deep Deterministic Policy Gradient (TD3) and Deep Deterministic Policy Gradient (DDPG), yielding MOMA-TD3 and MOMA-DDPG. The framework combines a multi-headed actor network, a centralised critic, and an objective preference-conditioning architecture, enabling a single neural network to encode the Pareto front of optimal trade-off policies for all agents across conflicting objectives in a continuous MOMARL setting. We also outline a natural test suite for continuous MOMARL by combining a pre-existing multi-agent single-objective physics simulator with its multi-objective single-agent counterpart. Evaluating cooperative locomotion tasks in this suite, we show that our framework achieves statistically significant improvements in expected utility and hypervolume relative to outer-loop and independent training baselines, while demonstrating stable scalability as the number of agents increases. These results establish our framework as a foundational step towards robust, scalable multi-objective policy learning in continuous multi-agent domains.

MOMA-AC: A preference-driven actor-critic framework for continuous multi-objective multi-agent reinforcement learning

TL;DR

The first dedicated inner-loop actor-critic framework for continuous state and action spaces: Multi-Objective Multi-Agent Actor-Critic (MOMA-AC) is introduced, establishing this framework as a foundational step towards robust, scalable multi-objective policy learning in continuous multi-agent domains.

Abstract

This paper addresses a critical gap in Multi-Objective Multi-Agent Reinforcement Learning (MOMARL) by introducing the first dedicated inner-loop actor-critic framework for continuous state and action spaces: Multi-Objective Multi-Agent Actor-Critic (MOMA-AC). Building on single-objective, single-agent algorithms, we instantiate this framework with Twin Delayed Deep Deterministic Policy Gradient (TD3) and Deep Deterministic Policy Gradient (DDPG), yielding MOMA-TD3 and MOMA-DDPG. The framework combines a multi-headed actor network, a centralised critic, and an objective preference-conditioning architecture, enabling a single neural network to encode the Pareto front of optimal trade-off policies for all agents across conflicting objectives in a continuous MOMARL setting. We also outline a natural test suite for continuous MOMARL by combining a pre-existing multi-agent single-objective physics simulator with its multi-objective single-agent counterpart. Evaluating cooperative locomotion tasks in this suite, we show that our framework achieves statistically significant improvements in expected utility and hypervolume relative to outer-loop and independent training baselines, while demonstrating stable scalability as the number of agents increases. These results establish our framework as a foundational step towards robust, scalable multi-objective policy learning in continuous multi-agent domains.

Paper Structure

This paper contains 35 sections, 31 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Architecture and training flow of the MOMA-AC framework for a three agent decomposition task. A sampled preference weight $\boldsymbol{\omega}$ conditions both the multi-headed actor and the centralised critic. Local observations and $\boldsymbol{\omega}$ form decentralised actions that interact with the environment, while the critic learns vector-valued Q-functions over joint states and actions. The actor is updated to maximise the scalarised critic output, enabling preference-conditioned policy learning under CTDE.
  • Figure 2: Walker(left), HalfCheetah (centre) and Hopper (right) morphologies in their decomposed form, adapted from the MaMuJoCo environment. Images sourced from the Farama Foundation documentation under the MIT License.
  • Figure 3: HalfCheetah (1, 2, 6 agents): EUM and HV learning curves comparing MOMA-TD3, MOMA-DDPG, IND, and Outer-Loop. MOMA-TD3 leads throughout, with gaps widening at higher agent counts.
  • Figure 4: Hopper (1, 3 agents): EUM and HV learning curves. Both MOMA-AC variants beat baselines; MOMA-TD3 leads while MOMA-DDPG remains competitive.
  • Figure 5: Walker (1, 2 agents): EUM and HV learning curves. Both MOMA-AC variants exceed baselines; MOMA-TD3 and MOMA-DDPG are comparable at these agent counts.