Table of Contents
Fetching ...

Preference-Conditioned Gradient Variations for Multi-Objective Quality-Diversity

Hannah Janmohamed, Maxence Faldor, Thomas Pierrot, Antoine Cully

TL;DR

The paper tackles the challenge of generating diverse, high-performing solutions for multi-objective tasks by integrating preference-conditioned gradient mutations with crowding in a Map-Elites-style archive. The proposed MOME-P2C uses a single, preference-conditioned actor-critic to produce policy-gradient updates aimed at targeted objective trade-offs, while crowding preserves a uniform spread of non-dominated solutions across feature cells. Empirical results across six robotics locomotion tasks, including tri-objective variants, show that MOME-P2C outperforms or matches state-of-the-art MOQD baselines and achieves smoother trade-offs, with ablations highlighting the importance of crowding, actor-injection, and gradient mutations. The approach offers improved data efficiency and scalability to higher-objective settings, with practical implications for robust, adaptable control policies in robotics and related domains.

Abstract

In a variety of domains, from robotics to finance, Quality-Diversity algorithms have been used to generate collections of both diverse and high-performing solutions. Multi-Objective Quality-Diversity algorithms have emerged as a promising approach for applying these methods to complex, multi-objective problems. However, existing methods are limited by their search capabilities. For example, Multi-Objective Map-Elites depends on random genetic variations which struggle in high-dimensional search spaces. Despite efforts to enhance search efficiency with gradient-based mutation operators, existing approaches consider updating solutions to improve on each objective separately rather than achieving desired trade-offs. In this work, we address this limitation by introducing Multi-Objective Map-Elites with Preference-Conditioned Policy-Gradient and Crowding Mechanisms: a new Multi-Objective Quality-Diversity algorithm that uses preference-conditioned policy-gradient mutations to efficiently discover promising regions of the objective space and crowding mechanisms to promote a uniform distribution of solutions on the non-dominated front. We evaluate our approach on six robotics locomotion tasks and show that our method outperforms or matches all state-of-the-art Multi-Objective Quality-Diversity methods in all six, including two newly proposed tri-objective tasks. Importantly, our method also achieves a smoother set of trade-offs, as measured by newly-proposed sparsity-based metrics.

Preference-Conditioned Gradient Variations for Multi-Objective Quality-Diversity

TL;DR

The paper tackles the challenge of generating diverse, high-performing solutions for multi-objective tasks by integrating preference-conditioned gradient mutations with crowding in a Map-Elites-style archive. The proposed MOME-P2C uses a single, preference-conditioned actor-critic to produce policy-gradient updates aimed at targeted objective trade-offs, while crowding preserves a uniform spread of non-dominated solutions across feature cells. Empirical results across six robotics locomotion tasks, including tri-objective variants, show that MOME-P2C outperforms or matches state-of-the-art MOQD baselines and achieves smoother trade-offs, with ablations highlighting the importance of crowding, actor-injection, and gradient mutations. The approach offers improved data efficiency and scalability to higher-objective settings, with practical implications for robust, adaptable control policies in robotics and related domains.

Abstract

In a variety of domains, from robotics to finance, Quality-Diversity algorithms have been used to generate collections of both diverse and high-performing solutions. Multi-Objective Quality-Diversity algorithms have emerged as a promising approach for applying these methods to complex, multi-objective problems. However, existing methods are limited by their search capabilities. For example, Multi-Objective Map-Elites depends on random genetic variations which struggle in high-dimensional search spaces. Despite efforts to enhance search efficiency with gradient-based mutation operators, existing approaches consider updating solutions to improve on each objective separately rather than achieving desired trade-offs. In this work, we address this limitation by introducing Multi-Objective Map-Elites with Preference-Conditioned Policy-Gradient and Crowding Mechanisms: a new Multi-Objective Quality-Diversity algorithm that uses preference-conditioned policy-gradient mutations to efficiently discover promising regions of the objective space and crowding mechanisms to promote a uniform distribution of solutions on the non-dominated front. We evaluate our approach on six robotics locomotion tasks and show that our method outperforms or matches all state-of-the-art Multi-Objective Quality-Diversity methods in all six, including two newly proposed tri-objective tasks. Importantly, our method also achieves a smoother set of trade-offs, as measured by newly-proposed sparsity-based metrics.

Paper Structure

This paper contains 34 sections, 13 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: Two sets of solutions that form different non-dominated fronts as approximations of the true Pareto front. The outer set (marked by circles) achieves a larger hypervolume as it extends further in objective space. Likewise, its sparsity metric is higher, reflecting a more even spread of solutions. Only the outer set of solutions are considered to be non-dominated when the two sets are combined.
  • Figure 2: Overview of mome -p2c algorithm. Non-dominated Fronts are stored in each cell of a map-elites grid. At each iteration, a batch of solutions are selected, undergo variation and are added back to the grid based on their performance and crowding-distances. As solutions are evaluated, environment transitions are gathered in a replay buffer and used to train preference-conditioned networks. These networks are used with a preference sampler to perform preference-conditioned pg updates.
  • Figure 3: moqd-score, global-hypervolume and maximum sum of scores (\ref{['sec:metrics']}) for mome -p2c compared to all baselines across all tasks. Each experiment is replicated 20 times with random seeds. The solid line is the median and the shaded area represents the first and third quartiles.
  • Figure 4: Boxplots to display sparsity metrics calculated on the final archive of mome -p2c and mome -pgx over 20 replications. The labels A2, A3, HC2, H2, H3 and W2 correspond to the ant-2, ant-3, halfcheetah-2, hopper-2, hopper-3 and walker-2 environments respectively.
  • Figure 5: moqd-score (\ref{['sec:metrics']}) for mome -p2c compared to all ablations across all tasks. Each experiment is replicated 20 times with random seeds. The solid line is the median and the shaded area represents the first and third quartiles.
  • ...and 8 more figures