Table of Contents
Fetching ...

MIMIC-D: Multi-modal Imitation for MultI-agent Coordination with Decentralized Diffusion Policies

Dayi Dong, Maulik Bhatt, Seoyeon Choi, Negar Mehr

TL;DR

MIMIC-D tackles multi-modal multi-agent coordination under decentralization by learning diffusion-based policies for each agent within a Centralized Training, Decentralized Execution framework. Each agent uses a conditional diffusion model conditioned on local observations, enabling implicit coordination without explicit communication. The approach demonstrates superior alignment to expert trajectory distributions and reduced collisions across simulated environments and hardware experiments, outperforming BC, MAGAIL, and Vanilla CTDE Diffusion. This work advances practical decentralized imitation for multi-agent robotics by capturing diverse coordination modes and enabling replanning with local information. Its significance lies in enabling robust, real-time coordination in real-world settings where centralized planners or inter-agent communication are impractical.

Abstract

As robots become more integrated in society, their ability to coordinate with other robots and humans on multi-modal tasks (those with multiple valid solutions) is crucial. We propose to learn such behaviors from expert demonstrations via imitation learning (IL). However, when expert demonstrations are multi-modal, standard IL approaches can struggle to capture the diverse strategies, hindering effective coordination. Diffusion models are known to be effective at handling complex multi-modal trajectory distributions in single-agent systems. Diffusion models have also excelled in multi-agent scenarios where multi-modality is more common and crucial to learning coordinated behaviors. Typically, diffusion-based approaches require a centralized planner or explicit communication among agents, but this assumption can fail in real-world scenarios where robots must operate independently or with agents like humans that they cannot directly communicate with. Therefore, we propose MIMIC-D, a Centralized Training, Decentralized Execution (CTDE) paradigm for multi-modal multi-agent imitation learning using diffusion policies. Agents are trained jointly with full information, but execute policies using only local information to achieve implicit coordination. We demonstrate in both simulation and hardware experiments that our method recovers multi-modal coordination behavior among agents in a variety of tasks and environments, while improving upon state-of-the-art baselines.

MIMIC-D: Multi-modal Imitation for MultI-agent Coordination with Decentralized Diffusion Policies

TL;DR

MIMIC-D tackles multi-modal multi-agent coordination under decentralization by learning diffusion-based policies for each agent within a Centralized Training, Decentralized Execution framework. Each agent uses a conditional diffusion model conditioned on local observations, enabling implicit coordination without explicit communication. The approach demonstrates superior alignment to expert trajectory distributions and reduced collisions across simulated environments and hardware experiments, outperforming BC, MAGAIL, and Vanilla CTDE Diffusion. This work advances practical decentralized imitation for multi-agent robotics by capturing diverse coordination modes and enabling replanning with local information. Its significance lies in enabling robust, real-time coordination in real-world settings where centralized planners or inter-agent communication are impractical.

Abstract

As robots become more integrated in society, their ability to coordinate with other robots and humans on multi-modal tasks (those with multiple valid solutions) is crucial. We propose to learn such behaviors from expert demonstrations via imitation learning (IL). However, when expert demonstrations are multi-modal, standard IL approaches can struggle to capture the diverse strategies, hindering effective coordination. Diffusion models are known to be effective at handling complex multi-modal trajectory distributions in single-agent systems. Diffusion models have also excelled in multi-agent scenarios where multi-modality is more common and crucial to learning coordinated behaviors. Typically, diffusion-based approaches require a centralized planner or explicit communication among agents, but this assumption can fail in real-world scenarios where robots must operate independently or with agents like humans that they cannot directly communicate with. Therefore, we propose MIMIC-D, a Centralized Training, Decentralized Execution (CTDE) paradigm for multi-modal multi-agent imitation learning using diffusion policies. Agents are trained jointly with full information, but execute policies using only local information to achieve implicit coordination. We demonstrate in both simulation and hardware experiments that our method recovers multi-modal coordination behavior among agents in a variety of tasks and environments, while improving upon state-of-the-art baselines.

Paper Structure

This paper contains 27 sections, 3 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: MIMIC-D deployed on a bimanual manipulation setup. An xArm7 and a Kinova3 robotic arm collaborate to lift a basket around an obstacle. The task presents a multi-modal coordination challenge as the arms need to coordinate to either pass the obstacle on the right or on the left. Using our method, the arms achieve coordination by independently sampling their policies based on local observations without explicit communication. Our method successfully recovers both solution modes, demonstrating its ability to capture diverse coordination strategies while avoiding the freezing robot problem. The panels depict the starting configuration ($t_1$), mode decision ($t_2$), and the completion of both modes ($t_3\, , t_4$).
  • Figure 2: An overview of our MIMIC-D framework. In the centralized training process (top), we utilize a dataset of multi-agent expert demonstrations to train the robot policies jointly. During the decentralized execution process (bottom), each agent plans their trajectory independently by only making use of its local observations to sample the diffusion model.
  • Figure 3: Visualizations of the Two-Agent Swap and Three-Agent Road Crossing environments, along with examples of the multi-modal expert demonstrations used to train the various models. The Swap environment included 6 different solution modes, and the Road Crossing environment did not have explicit modes but rather more subtle collision avoidance behaviors.
  • Figure 4: Two-arm lift environment visualizations. We provide an example of what the two-arm lift task looks like for the Robosuite simulation version and the hardware demonstration. The simulation task uses two Kinova3 arms, while the hardware task uses one Kinova3 and one XArm7 arm.