Table of Contents
Fetching ...

Select to Perfect: Imitating desired behavior from large multi-agent data

Tim Franzmeyer, Edith Elkind, Philip Torr, Jakob Foerster, Joao Henriques

TL;DR

This work tackles the challenge of learning from heterogeneous multi-agent datasets while ensuring alignment with a user-defined desirability criterion, represented by a collective DVF. It introduces Exchange Values ($EV$) as the agent-wise contribution measure to the DVF, and shows how $EV$ relates to Shapley Values while accommodating fixed-group-size constraints. The authors propose EV-Clustering to estimate $EV$s from limited or anonymized data and EV2BC to train policies by selectively imitating high-contribution agents, outperforming standard behavior cloning and offline RL baselines across ToC, Overcooked, and StarCraft scenarios. The approach enables learning from mixed-behavior datasets by privileging contributions to desirable outcomes, with practical implications for safe and effective multi-agent imitation in complex domains.

Abstract

AI agents are commonly trained with large datasets of demonstrations of human behavior. However, not all behaviors are equally safe or desirable. Desired characteristics for an AI agent can be expressed by assigning desirability scores, which we assume are not assigned to individual behaviors but to collective trajectories. For example, in a dataset of vehicle interactions, these scores might relate to the number of incidents that occurred. We first assess the effect of each individual agent's behavior on the collective desirability score, e.g., assessing how likely an agent is to cause incidents. This allows us to selectively imitate agents with a positive effect, e.g., only imitating agents that are unlikely to cause incidents. To enable this, we propose the concept of an agent's Exchange Value, which quantifies an individual agent's contribution to the collective desirability score. The Exchange Value is the expected change in desirability score when substituting the agent for a randomly selected agent. We propose additional methods for estimating Exchange Values from real-world datasets, enabling us to learn desired imitation policies that outperform relevant baselines. The project website can be found at https://tinyurl.com/select-to-perfect.

Select to Perfect: Imitating desired behavior from large multi-agent data

TL;DR

This work tackles the challenge of learning from heterogeneous multi-agent datasets while ensuring alignment with a user-defined desirability criterion, represented by a collective DVF. It introduces Exchange Values () as the agent-wise contribution measure to the DVF, and shows how relates to Shapley Values while accommodating fixed-group-size constraints. The authors propose EV-Clustering to estimate s from limited or anonymized data and EV2BC to train policies by selectively imitating high-contribution agents, outperforming standard behavior cloning and offline RL baselines across ToC, Overcooked, and StarCraft scenarios. The approach enables learning from mixed-behavior datasets by privileging contributions to desirable outcomes, with practical implications for safe and effective multi-agent imitation in complex domains.

Abstract

AI agents are commonly trained with large datasets of demonstrations of human behavior. However, not all behaviors are equally safe or desirable. Desired characteristics for an AI agent can be expressed by assigning desirability scores, which we assume are not assigned to individual behaviors but to collective trajectories. For example, in a dataset of vehicle interactions, these scores might relate to the number of incidents that occurred. We first assess the effect of each individual agent's behavior on the collective desirability score, e.g., assessing how likely an agent is to cause incidents. This allows us to selectively imitate agents with a positive effect, e.g., only imitating agents that are unlikely to cause incidents. To enable this, we propose the concept of an agent's Exchange Value, which quantifies an individual agent's contribution to the collective desirability score. The Exchange Value is the expected change in desirability score when substituting the agent for a randomly selected agent. We propose additional methods for estimating Exchange Values from real-world datasets, enabling us to learn desired imitation policies that outperform relevant baselines. The project website can be found at https://tinyurl.com/select-to-perfect.
Paper Structure (47 sections, 3 theorems, 24 equations, 8 figures, 3 tables)

This paper contains 47 sections, 3 theorems, 24 equations, 8 figures, 3 tables.

Key Result

Proposition A.1

For any characteristic function game $G=(N, v)$ and every agent $i\in N$ we have

Figures (8)

  • Figure 1: We are given a dataset composed of multi-agent trajectories generated by many individual agents, e.g., a dataset of cars driving in urban environments. In addition, the Desired Value Function (DVF) indicates the desirability score of a collective trajectory, e.g., the number of incidents that occurred. We first compute the Exchange Value (EV) of each agent, where a positive EV indicates that an agent increases the desirability score (e.g. an agent driving safely). We reformulate imitation learning to take into account the computed EVs, and achieve an imitation policy that is aligned with the DVF (e.g. only imitating the behavior of safe drivers).
  • Figure 1: Resulting performance with respect to the DVF for different imitation learning methods in different Starcraft scenarios.
  • Figure 2: Overview of different characteristics of real-world datasets, and whether Shapley Values and Exchange Values (EVs) are applicable to compute contributions of individual agents to the DVF.
  • Figure 3: Mean error in estimating EVs with decreasing number of observations. 'Deg.' refers to the fully anonymized degenerate case. Error decreases significantly if agents are clustered (green-shaded area).
  • Figure 4: In the Overcooked environments Cramped Room (left) and Coordination Ring (right), agents must cooperate to cook and deliver as many soups as possible within a given time.
  • ...and 3 more figures

Theorems & Definitions (14)

  • Definition 3.1: Characteristic function game
  • Definition 3.2: Shapley Value
  • Definition 4.1: Exchange Value
  • Definition 4.2: Constrained characteristic function game
  • Definition 4.3: Constrained Exchange Value
  • Definition 4.4: EV-Clustering
  • Definition 4.5: EV based Behavior Cloning (EV2BC)
  • Proposition A.1
  • proof
  • Example A.2
  • ...and 4 more