MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations

The Viet Bui; Tien Mai; Hong Thanh Nguyen

MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations

The Viet Bui, Tien Mai, Hong Thanh Nguyen

TL;DR

MisoDICE tackles offline multi-agent imitation learning from unlabeled mixed-quality demonstrations by coupling a two-stage labeling process (LLM-based preferences refined by O-MAPL to recover rewards) with a convex, CTDE-based multi-agent IL method that uses a linear value-decomposition mixer to preserve global-local consistency. The approach enables robust policy learning from heterogeneous data and scales to large joint action spaces, backed by theoretical guarantees on convexity and consistency. Empirically, MisoDICE outperforms diverse baselines on SMACv2 and MaMujoco benchmarks, with particular gains when expert data are scarce, and ablation studies validate the importance of the mixing architecture and labeling strategy. The framework demonstrates the practicality of leveraging LLMs for expert-trajectory identification in MARL and provides a scalable blueprint for offline learning from mixed-quality demonstrations.

Abstract

We study offline imitation learning (IL) in cooperative multi-agent settings, where demonstrations have unlabeled mixed quality - containing both expert and suboptimal trajectories. Our proposed solution is structured in two stages: trajectory labeling and multi-agent imitation learning, designed jointly to enable effective learning from heterogeneous, unlabeled data. In the first stage, we combine advances in large language models and preference-based reinforcement learning to construct a progressive labeling pipeline that distinguishes expert-quality trajectories. In the second stage, we introduce MisoDICE, a novel multi-agent IL algorithm that leverages these labels to learn robust policies while addressing the computational complexity of large joint state-action spaces. By extending the popular single-agent DICE framework to multi-agent settings with a new value decomposition and mixing architecture, our method yields a convex policy optimization objective and ensures consistency between global and local policies. We evaluate MisoDICE on multiple standard multi-agent RL benchmarks and demonstrate superior performance, especially when expert data is scarce.

MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations

TL;DR

Abstract

MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (15)