COMBO: Compositional World Models for Embodied Multi-Agent Cooperation

Hongxin Zhang; Zeyuan Wang; Qiushi Lyu; Zheyuan Zhang; Sunli Chen; Tianmin Shu; Behzad Dariush; Kwonjoon Lee; Yilun Du; Chuang Gan

COMBO: Compositional World Models for Embodied Multi-Agent Cooperation

Hongxin Zhang, Zeyuan Wang, Qiushi Lyu, Zheyuan Zhang, Sunli Chen, Tianmin Shu, Behzad Dariush, Kwonjoon Lee, Yilun Du, Chuang Gan

TL;DR

COMBO tackles embodied multi-agent cooperation under partial observability by learning a compositional diffusion-based world model that factors joint actions into per-agent components and composes their effects on future frames. It integrates this world model with Vision-Language planning submodules (Action Proposer, Intent Tracker, Outcome Evaluator) and a tree-search planner to enable online cooperative planning. The approach reconstructs a global world state from egocentric views, then imagines action outcomes to guide long-horizon coordination, achieving strong performance on TDW-based benchmarks and generalizing to different agent counts. The results highlight the value of compositional dynamics and VLM-based planning for scalable, cooperative embodied AI, while pointing to efficiency improvements for real-time deployment.

Abstract

In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics conditioned on an arbitrary number of agents' actions given only partial egocentric visual observations of the world. To address this issue of partial observability, we first train generative models to estimate the overall world state given partial egocentric observations. To enable accurate simulation of multiple sets of actions on this world state, we then propose to learn a compositional world model for multi-agent cooperation by factorizing the naturally composable joint actions of multiple agents and compositionally generating the video conditioned on the world state. By leveraging this compositional world model, in combination with Vision Language Models to infer the actions of other agents, we can use a tree search procedure to integrate these modules and facilitate online cooperative planning. We evaluate our methods on three challenging benchmarks with 2-4 agents. The results show our compositional world model is effective and the framework enables the embodied agents to cooperate efficiently with different agents across various tasks and an arbitrary number of agents, showing the promising future of our proposed methods. More videos can be found at https://umass-embodied-agi.github.io/COMBO/.

COMBO: Compositional World Models for Embodied Multi-Agent Cooperation

TL;DR

Abstract

Paper Structure (47 sections, 8 equations, 13 figures, 8 tables, 1 algorithm)

This paper contains 47 sections, 8 equations, 13 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Multi-Agent Planning
Large Generative Models for Embodied AI
Preliminaries
Problem Statement
Video Diffusion Models
Composable Diffusion Models
Compositional World Model
Composable Video Diffusion Models
Agent-Dependent Loss Scaling
Compositional World Model for Multi-Agent Planning
World State Estimation with Partial Egocentric Views
Planning Sub-modules with Vision Language Models
Planning Procedure with Tree Search
...and 32 more sections

Figures (13)

Figure 1: (a) Two challenging embodied multi-agent visual cooperation benchmarks TDW-Cook and TDW-Game, where 2-4 agents cooperate to finish dishes according to the recipe or finish puzzles according to the visual clue. (b) The agent needs to infer other agents' intents, propose possible actions, and accurately simulate how the world may be affected by multiple sets of actions to make efficient cooperation in the long run.
Figure 2: Compositional World Model. Given the current world state $x_0$ and joint action of multiple agents $a$, the compositional world model predicts the future states by first factorizing $a$ into several components $a_i$ corresponding to each agent, then generating multiple scores conditioned on the current world state and the text components, finally composing them to generate the video.
Figure 3: Method Overview. (a) Given partial egocentric RGBD observations, COMBO first reconstructs and inpaints the top-down orthographic image as the overall world state estimation. (b) COMBO then leverage the planning sub-modules built with Vision Language Models to propose actions, infer other agents' intents, and evaluate the outcomes simulated with the compositional world model to plan online with a tree search procedure to cooperate in the long run.
Figure 4: Compositional World Model learns world dynamics better. Our compositional world model can simulate world dynamics conditioned on the joint action of multiple agents accurately while AVDC struggles with simulating which agents should act, and COMBO w.o ADLS may simulate actions incorrectly.
Figure 5: More Computation budgets leads to better plan. With more computation budgets (second row), COMBO can search for a better plan where Alice first clears the common region with David so that he can pass the next puzzle piece to her instead of having to wait, leading to a better state after same number of steps.
...and 8 more figures

COMBO: Compositional World Models for Embodied Multi-Agent Cooperation

TL;DR

Abstract

COMBO: Compositional World Models for Embodied Multi-Agent Cooperation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)