Table of Contents
Fetching ...

ColorGrid: A Multi-Agent Non-Stationary Environment for Goal Inference and Assistance

Andrey Risukhin, Kavel Rao, Ben Caffee, Alan Fan

TL;DR

ColorGrid introduces a non-stationary, asymmetric MARL benchmark to study real-time human goal inference and cooperative assistance. Using IPPO as a baseline, the paper shows that standard learning approaches struggle when a follower must infer a leader’s changing goal without explicit communication, even under symmetric information; various architectural and training strategies, including an auxiliary supervised loss and reward shaping, provide partial benefits. The authors demonstrate the critical roles of exploration cost, penalty annealing, and balanced learning, and show that warmstarting and supervised objectives can stabilize or enhance learning in certain regimes. The work provides a valuable benchmark, datasets, and visualizations to spur development of algorithms capable of robust goal inference and assistance in real-world human–AI collaboration scenarios.

Abstract

Autonomous agents' interactions with humans are increasingly focused on adapting to their changing preferences in order to improve assistance in real-world tasks. Effective agents must learn to accurately infer human goals, which are often hidden, to collaborate well. However, existing Multi-Agent Reinforcement Learning (MARL) environments lack the necessary attributes required to rigorously evaluate these agents' learning capabilities. To this end, we introduce ColorGrid, a novel MARL environment with customizable non-stationarity, asymmetry, and reward structure. We investigate the performance of Independent Proximal Policy Optimization (IPPO), a state-of-the-art (SOTA) MARL algorithm, in ColorGrid and find through extensive ablations that, particularly with simultaneous non-stationary and asymmetric goals between a ``leader'' agent representing a human and a ``follower'' assistant agent, ColorGrid is unsolved by IPPO. To support benchmarking future MARL algorithms, we release our environment code, model checkpoints, and trajectory visualizations at https://github.com/andreyrisukhin/ColorGrid.

ColorGrid: A Multi-Agent Non-Stationary Environment for Goal Inference and Assistance

TL;DR

ColorGrid introduces a non-stationary, asymmetric MARL benchmark to study real-time human goal inference and cooperative assistance. Using IPPO as a baseline, the paper shows that standard learning approaches struggle when a follower must infer a leader’s changing goal without explicit communication, even under symmetric information; various architectural and training strategies, including an auxiliary supervised loss and reward shaping, provide partial benefits. The authors demonstrate the critical roles of exploration cost, penalty annealing, and balanced learning, and show that warmstarting and supervised objectives can stabilize or enhance learning in certain regimes. The work provides a valuable benchmark, datasets, and visualizations to spur development of algorithms capable of robust goal inference and assistance in real-world human–AI collaboration scenarios.

Abstract

Autonomous agents' interactions with humans are increasingly focused on adapting to their changing preferences in order to improve assistance in real-world tasks. Effective agents must learn to accurately infer human goals, which are often hidden, to collaborate well. However, existing Multi-Agent Reinforcement Learning (MARL) environments lack the necessary attributes required to rigorously evaluate these agents' learning capabilities. To this end, we introduce ColorGrid, a novel MARL environment with customizable non-stationarity, asymmetry, and reward structure. We investigate the performance of Independent Proximal Policy Optimization (IPPO), a state-of-the-art (SOTA) MARL algorithm, in ColorGrid and find through extensive ablations that, particularly with simultaneous non-stationary and asymmetric goals between a ``leader'' agent representing a human and a ``follower'' assistant agent, ColorGrid is unsolved by IPPO. To support benchmarking future MARL algorithms, we release our environment code, model checkpoints, and trajectory visualizations at https://github.com/andreyrisukhin/ColorGrid.
Paper Structure (41 sections, 1 equation, 3 figures, 2 tables)

This paper contains 41 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: ColorGrid visualization demonstrating how assistant agent (follower) learns after observing human (leader) actions.
  • Figure 2: Agent learning in a symmetric vs. asymmetric scenario.
  • Figure 3: Using a frozen expert leader trained with IPPO, we train a cold-started follower varying the training reward structure such that random block collection would have positive, neutral, and negative expected value (EV). Dotted lines are baselines of the expert leader, A* search leader, and A* search follower which routes to the last color picked up by the A* leader. We use seed 0 for these comparisons, except for the A* agent scores which are averaged across 100 seeds.