Table of Contents
Fetching ...

Scaling Multi-Agent Environment Co-Design with Diffusion Models

Hao Xiang Li, Michael Amir, Amanda Prorok

TL;DR

This paper tackles the challenge of scaling agent-environment co-design by formulating co-design as an underspecified multi-agent problem and introducing a diffusion-based framework, Diffusion Co-Design (DiCoDe). It combines Projected Universal Guidance (PUG) to generate reward-maximising environments under hard constraints with critic distillation to provide a dense, up-to-date learning signal from the agent critic to the environment critic, enabling rapid adaptation as policies evolve. Empirically, DiCoDe delivers state-of-the-art performance across warehouse, wind-farm, and multi-agent navigation benchmarks, achieving up to 39% higher rewards with 66% fewer simulation samples. The work demonstrates improved sample efficiency and scalability for co-design, enabling practical deployment of co-designed agent–environment pairs in real-world domains.

Abstract

The agent-environment co-design paradigm jointly optimises agent policies and environment configurations in search of improved system performance. With application domains ranging from warehouse logistics to windfarm management, co-design promises to fundamentally change how we deploy multi-agent systems. However, current co-design methods struggle to scale. They collapse under high-dimensional environment design spaces and suffer from sample inefficiency when addressing moving targets inherent to joint optimisation. We address these challenges by developing Diffusion Co-Design (DiCoDe), a scalable and sample-efficient co-design framework pushing co-design towards practically relevant settings. DiCoDe incorporates two core innovations. First, we introduce Projected Universal Guidance (PUG), a sampling technique that enables DiCoDe to explore a distribution of reward-maximising environments while satisfying hard constraints such as spatial separation between obstacles. Second, we devise a critic distillation mechanism to share knowledge from the reinforcement learning critic, ensuring that the guided diffusion model adapts to evolving agent policies using a dense and up-to-date learning signal. Together, these improvements lead to superior environment-policy pairs when validated on challenging multi-agent environment co-design benchmarks including warehouse automation, multi-agent pathfinding and wind farm optimisation. Our method consistently exceeds the state-of-the-art, achieving, for example, 39% higher rewards in the warehouse setting with 66% fewer simulation samples. This sets a new standard in agent-environment co-design, and is a stepping stone towards reaping the rewards of co-design in real world domains.

Scaling Multi-Agent Environment Co-Design with Diffusion Models

TL;DR

This paper tackles the challenge of scaling agent-environment co-design by formulating co-design as an underspecified multi-agent problem and introducing a diffusion-based framework, Diffusion Co-Design (DiCoDe). It combines Projected Universal Guidance (PUG) to generate reward-maximising environments under hard constraints with critic distillation to provide a dense, up-to-date learning signal from the agent critic to the environment critic, enabling rapid adaptation as policies evolve. Empirically, DiCoDe delivers state-of-the-art performance across warehouse, wind-farm, and multi-agent navigation benchmarks, achieving up to 39% higher rewards with 66% fewer simulation samples. The work demonstrates improved sample efficiency and scalability for co-design, enabling practical deployment of co-designed agent–environment pairs in real-world domains.

Abstract

The agent-environment co-design paradigm jointly optimises agent policies and environment configurations in search of improved system performance. With application domains ranging from warehouse logistics to windfarm management, co-design promises to fundamentally change how we deploy multi-agent systems. However, current co-design methods struggle to scale. They collapse under high-dimensional environment design spaces and suffer from sample inefficiency when addressing moving targets inherent to joint optimisation. We address these challenges by developing Diffusion Co-Design (DiCoDe), a scalable and sample-efficient co-design framework pushing co-design towards practically relevant settings. DiCoDe incorporates two core innovations. First, we introduce Projected Universal Guidance (PUG), a sampling technique that enables DiCoDe to explore a distribution of reward-maximising environments while satisfying hard constraints such as spatial separation between obstacles. Second, we devise a critic distillation mechanism to share knowledge from the reinforcement learning critic, ensuring that the guided diffusion model adapts to evolving agent policies using a dense and up-to-date learning signal. Together, these improvements lead to superior environment-policy pairs when validated on challenging multi-agent environment co-design benchmarks including warehouse automation, multi-agent pathfinding and wind farm optimisation. Our method consistently exceeds the state-of-the-art, achieving, for example, 39% higher rewards in the warehouse setting with 66% fewer simulation samples. This sets a new standard in agent-environment co-design, and is a stepping stone towards reaping the rewards of co-design in real world domains.

Paper Structure

This paper contains 26 sections, 20 equations, 11 figures, 4 tables, 2 algorithms.

Figures (11)

  • Figure 1: General framework of our diffusion co-design method. In extension of a MARL iteration, we introduce an environment critic trained using critic distillation. This guides a diffusion model via a carefully designed sampling process that satisfies hard constraints, generating a distribution of highly-rewarding environments to collect trajectories upon. Repeating this process leads to consistently superior policy-environment tuples.
  • Figure 2: Left) Corner scenario training curves with example of randomly sampled environment and a DiCoDe generated environment after training. We report the mean episode return, smoothed, with $95\%$ confidence intervals shaded. Episode reward corresponds to boxes delivered. Right) Heatmap of shelf placement by DiCoDe across $100$ environments. DiCoDe learns to generate from random environments to placing shelves near goals of the same colour with navigation channels free.
  • Figure 3: Rendering of environments before and after training.
  • Figure 3: Corner. Left) For each method, we sample $32$ environments with guidance from the same critic, and report the value estimated by that critic. Right) Probes of environment critic training. We compare min, max $y$, the learning objective of the environment critic, within each batch generated by DiCoDe (critic distillation) and DiCoDe-MC (sampled trajectory returns). Both are estimates of the true discounted return of an environment. We report the environment critic learning loss.
  • Figure 4: Results on continuous environment design spaces. Left) Performance of co-design methods relative to domain randomisation against the number of turbines in WFCRL. Right) Examples of generated environments after training, with ONav and WFCRL4.
  • ...and 6 more figures