Table of Contents
Fetching ...

MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations

The Viet Bui, Tien Mai, Hong Thanh Nguyen

TL;DR

MisoDICE tackles offline multi-agent imitation learning from unlabeled mixed-quality demonstrations by coupling a two-stage labeling process (LLM-based preferences refined by O-MAPL to recover rewards) with a convex, CTDE-based multi-agent IL method that uses a linear value-decomposition mixer to preserve global-local consistency. The approach enables robust policy learning from heterogeneous data and scales to large joint action spaces, backed by theoretical guarantees on convexity and consistency. Empirically, MisoDICE outperforms diverse baselines on SMACv2 and MaMujoco benchmarks, with particular gains when expert data are scarce, and ablation studies validate the importance of the mixing architecture and labeling strategy. The framework demonstrates the practicality of leveraging LLMs for expert-trajectory identification in MARL and provides a scalable blueprint for offline learning from mixed-quality demonstrations.

Abstract

We study offline imitation learning (IL) in cooperative multi-agent settings, where demonstrations have unlabeled mixed quality - containing both expert and suboptimal trajectories. Our proposed solution is structured in two stages: trajectory labeling and multi-agent imitation learning, designed jointly to enable effective learning from heterogeneous, unlabeled data. In the first stage, we combine advances in large language models and preference-based reinforcement learning to construct a progressive labeling pipeline that distinguishes expert-quality trajectories. In the second stage, we introduce MisoDICE, a novel multi-agent IL algorithm that leverages these labels to learn robust policies while addressing the computational complexity of large joint state-action spaces. By extending the popular single-agent DICE framework to multi-agent settings with a new value decomposition and mixing architecture, our method yields a convex policy optimization objective and ensures consistency between global and local policies. We evaluate MisoDICE on multiple standard multi-agent RL benchmarks and demonstrate superior performance, especially when expert data is scarce.

MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations

TL;DR

MisoDICE tackles offline multi-agent imitation learning from unlabeled mixed-quality demonstrations by coupling a two-stage labeling process (LLM-based preferences refined by O-MAPL to recover rewards) with a convex, CTDE-based multi-agent IL method that uses a linear value-decomposition mixer to preserve global-local consistency. The approach enables robust policy learning from heterogeneous data and scales to large joint action spaces, backed by theoretical guarantees on convexity and consistency. Empirically, MisoDICE outperforms diverse baselines on SMACv2 and MaMujoco benchmarks, with particular gains when expert data are scarce, and ablation studies validate the importance of the mixing architecture and labeling strategy. The framework demonstrates the practicality of leveraging LLMs for expert-trajectory identification in MARL and provides a scalable blueprint for offline learning from mixed-quality demonstrations.

Abstract

We study offline imitation learning (IL) in cooperative multi-agent settings, where demonstrations have unlabeled mixed quality - containing both expert and suboptimal trajectories. Our proposed solution is structured in two stages: trajectory labeling and multi-agent imitation learning, designed jointly to enable effective learning from heterogeneous, unlabeled data. In the first stage, we combine advances in large language models and preference-based reinforcement learning to construct a progressive labeling pipeline that distinguishes expert-quality trajectories. In the second stage, we introduce MisoDICE, a novel multi-agent IL algorithm that leverages these labels to learn robust policies while addressing the computational complexity of large joint state-action spaces. By extending the popular single-agent DICE framework to multi-agent settings with a new value decomposition and mixing architecture, our method yields a convex policy optimization objective and ensures consistency between global and local policies. We evaluate MisoDICE on multiple standard multi-agent RL benchmarks and demonstrate superior performance, especially when expert data is scarce.

Paper Structure

This paper contains 46 sections, 10 theorems, 37 equations, 12 figures, 14 tables, 3 algorithms.

Key Result

Proposition 5.1

If the mixing network $\mathcal{M}_\phi[\pmb{\nu}(\textbf{s})]$ is a linear function of both $\pmb{\nu}(\textbf{s})$ and $\phi$, then the training objective $\mathcal{L}(\phi, \pmb{\nu})$ is convex in both $\phi$ and $\pmb{\nu}$.

Figures (12)

  • Figure 1: Learning curves of the average return for MisoDICE and baseline methods on SMACv2.
  • Figure 2: Box plots of final returns on SMACv2 by varying the number of top-k expert trajectories.
  • Figure 3: Generalized Expert Dataset Identification via Preference Learning
  • Figure 4: Multi-Agent Imitation Policy Learning
  • Figure 5: Learning curves of the average return for MisoDICE and baseline methods on SMACv1 tasks when using an LLM-based preference labeling approach.
  • ...and 7 more figures

Theorems & Definitions (15)

  • Proposition 5.1
  • Proposition 5.2
  • Proposition 5.3
  • Proposition 5.4: Global–Local Consistency
  • Proposition 5.5: Local Policy as a Softmax over Local Functions
  • Proposition
  • proof
  • Proposition
  • proof
  • Proposition
  • ...and 5 more