Post Hoc Extraction of Pareto Fronts for Continuous Control

Raghav Thakar; Gaurav Dixit; Kagan Tumer

Post Hoc Extraction of Pareto Fronts for Continuous Control

Raghav Thakar, Gaurav Dixit, Kagan Tumer

TL;DR

Mixed Advantage Pareto Extraction (MAPEX), an offline MORL method that constructs a frontier of policies by reusing pre-trained specialist policies, critics, and replay buffers, and weights a behaviour cloning loss with it to train new policies that balance multiple objectives.

Abstract

Agents in the real world must often balance multiple objectives, such as speed, stability, and energy efficiency in continuous control. To account for changing conditions and preferences, an agent must ideally learn a Pareto frontier of policies representing multiple optimal trade-offs. Recent advances in multi-policy multi-objective reinforcement learning (MORL) enable learning a Pareto front directly, but require full multi-objective consideration from the start of training. In practice, multi-objective preferences often arise after a policy has already been trained on a single specialised objective. Existing MORL methods cannot leverage these pre-trained `specialists' to learn Pareto fronts and avoid incurring the sample costs of retraining. We introduce Mixed Advantage Pareto Extraction (MAPEX), an offline MORL method that constructs a frontier of policies by reusing pre-trained specialist policies, critics, and replay buffers. MAPEX combines evaluations from specialist critics into a mixed advantage signal, and weights a behaviour cloning loss with it to train new policies that balance multiple objectives. MAPEX's post hoc Pareto front extraction preserves the simplicity of single-objective off-policy RL, and avoids retrofitting these algorithms into complex MORL frameworks. We formally describe the MAPEX procedure and evaluate MAPEX on five multi-objective MuJoCo environments. Given the same starting policies, MAPEX produces comparable fronts at $0.001\%$ the sample cost of established baselines.

Post Hoc Extraction of Pareto Fronts for Continuous Control

TL;DR

Abstract

the sample cost of established baselines.

Paper Structure (28 sections, 12 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 28 sections, 12 equations, 4 figures, 6 tables, 1 algorithm.

Introduction
Background
Actor--Critic Methods and Offline Reinforcement Learning
Actor--critic methods:
Offline Reinforcement Learning:
Multi-Objective Sequential Decision-Making
Pareto Dominance and Optimality
Related Works
Method
Step 1: Gap Identification and Parent Selection
Step 2: Hybrid Buffer Creation and Advantage Mixing
Step 3: Mixed Advantage Weighted Regression
Mitigating Out-of-Distribution Error
Secondary Critics
Notation:
...and 13 more sections

Figures (4)

Figure 1: Sample efficiency comparison on MO-Ant-v5. (Top) Mean hypervolume $\pm$ SEM vs. cumulative environment samples. MAPEX and MAPEX-PostHoc achieve high hypervolume almost instantaneously, while MOPDERL requires significantly more interaction. (Bottom) Evolution of the Pareto front approximation. MAPEX/MAPEX-PostHoc fill the front immediately, whereas MOPDERL gradually expands coverage over 300,000+ environment samples.
Figure 2: Samples required to attain target hypervolume thresholds. Comparison on MO-Hopper-v5 (top) and MO-Walker2d-v5 (bottom). Note the logarithmic scale on the y-axis. MAPEX and MAPEX-PostHoc require up to three orders of magnitude fewer samples ($10^2$ vs $10^5$) than MOPDERL to reach identical performance levels.
Figure 3: Robustness of MAPEX to specialist type and critic training. Mean hypervolume (± SEM) over generations on MO-Walker2d-v5 and MO-Ant-v5. The similar performance of standard MAPEX, MAPEX-PostHoc (offline critics), and MAPEX-TD3 (off-policy specialists) demonstrates the method's flexibility in effectively extracting fronts from decoupled pre-trained sources.
Figure 4: Final Pareto fronts across five MO-MuJoCo benchmarks. Comparison of fronts extracted by MAPEX against fully trained MOPDERL and MORL/D baselines. Despite relying purely on single-objective training data, MAPEX recovers fronts that are dense and competitive with baselines trained from scratch.

Post Hoc Extraction of Pareto Fronts for Continuous Control

TL;DR

Abstract

Post Hoc Extraction of Pareto Fronts for Continuous Control

Authors

TL;DR

Abstract

Table of Contents

Figures (4)