Table of Contents
Fetching ...

Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and Evaluation

Claude Formanek, Callum Rhys Tilbury, Louise Beyers, Jonathan Shock, Arnu Pretorius

TL;DR

It is demonstrated that simple, well-implemented baselines can achieve state-of-the-art (SOTA) results across a wide range of tasks, and significant shortcomings in existing methodologies for measuring the performance of novel algorithms are identified.

Abstract

Offline multi-agent reinforcement learning (MARL) is an emerging field with great promise for real-world applications. Unfortunately, the current state of research in offline MARL is plagued by inconsistencies in baselines and evaluation protocols, which ultimately makes it difficult to accurately assess progress, trust newly proposed innovations, and allow researchers to easily build upon prior work. In this paper, we firstly identify significant shortcomings in existing methodologies for measuring the performance of novel algorithms through a representative study of published offline MARL work. Secondly, by directly comparing to this prior work, we demonstrate that simple, well-implemented baselines can achieve state-of-the-art (SOTA) results across a wide range of tasks. Specifically, we show that on 35 out of 47 datasets used in prior work (almost 75% of cases), we match or surpass the performance of the current purported SOTA. Strikingly, our baselines often substantially outperform these more sophisticated algorithms. Finally, we correct for the shortcomings highlighted from this prior work by introducing a straightforward standardised methodology for evaluation and by providing our baseline implementations with statistically robust results across several scenarios, useful for comparisons in future work. Our proposal includes simple and sensible steps that are easy to adopt, which in combination with solid baselines and comparative results, could substantially improve the overall rigour of empirical science in offline MARL moving forward.

Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and Evaluation

TL;DR

It is demonstrated that simple, well-implemented baselines can achieve state-of-the-art (SOTA) results across a wide range of tasks, and significant shortcomings in existing methodologies for measuring the performance of novel algorithms are identified.

Abstract

Offline multi-agent reinforcement learning (MARL) is an emerging field with great promise for real-world applications. Unfortunately, the current state of research in offline MARL is plagued by inconsistencies in baselines and evaluation protocols, which ultimately makes it difficult to accurately assess progress, trust newly proposed innovations, and allow researchers to easily build upon prior work. In this paper, we firstly identify significant shortcomings in existing methodologies for measuring the performance of novel algorithms through a representative study of published offline MARL work. Secondly, by directly comparing to this prior work, we demonstrate that simple, well-implemented baselines can achieve state-of-the-art (SOTA) results across a wide range of tasks. Specifically, we show that on 35 out of 47 datasets used in prior work (almost 75% of cases), we match or surpass the performance of the current purported SOTA. Strikingly, our baselines often substantially outperform these more sophisticated algorithms. Finally, we correct for the shortcomings highlighted from this prior work by introducing a straightforward standardised methodology for evaluation and by providing our baseline implementations with statistically robust results across several scenarios, useful for comparisons in future work. Our proposal includes simple and sensible steps that are easy to adopt, which in combination with solid baselines and comparative results, could substantially improve the overall rigour of empirical science in offline MARL moving forward.
Paper Structure (29 sections, 8 figures, 15 tables)

This paper contains 29 sections, 8 figures, 15 tables.

Figures (8)

  • Figure 1: We compare our baseline implementations to the reported performance of various algorithms from the literature across a wide range of datasets. We normalise results from each dataset (i.e. scenario-quality-source combination) by the SOTA performance from the literature for that dataset. Standard deviation bars are given and when our baseline is significantly better or equal to the best method, using a two-side t-test, we indicate so using a gold star. We find that on 35 out of the 47 datasets tested (almost 75% of cases), we match or surpass the performance of the current SOTA.
  • Figure 2: Comparing the performance of QMIX+CQL and MADDPG+CQL, two algorithms that could reasonably be called MACQL in the literature (see Table \ref{['table:macql']}), using the Medium dataset from three different SMACv1 scenarios. We see that the difference in performance of these algorithms is significant, and depends on the scenario considered.
  • Figure 3: A comparison of the performance of behaviour cloning (BC) and QMIX+CQL on the SMACv1 8m scenario with the Medium dataset, across 10 seeds. Although QMIX+CQL outperforms BC during the first half of training, its performance deteriorates in the second half, making BC the preferred algorithm over the maximum training time.
  • Figure 4: Performance profiles agarwal2022deep aggregated across all results from Table \ref{['tab:new-foundations-results']} on SMACv1 and MAMuJoCo. Scores are normalised as per fu2021d4rl.
  • Figure 5: Results reported by OMAC on the SMAC environment.
  • ...and 3 more figures