Select to Perfect: Imitating desired behavior from large multi-agent data

Tim Franzmeyer; Edith Elkind; Philip Torr; Jakob Foerster; Joao Henriques

Select to Perfect: Imitating desired behavior from large multi-agent data

Tim Franzmeyer, Edith Elkind, Philip Torr, Jakob Foerster, Joao Henriques

TL;DR

This work tackles the challenge of learning from heterogeneous multi-agent datasets while ensuring alignment with a user-defined desirability criterion, represented by a collective DVF. It introduces Exchange Values ($EV$) as the agent-wise contribution measure to the DVF, and shows how $EV$ relates to Shapley Values while accommodating fixed-group-size constraints. The authors propose EV-Clustering to estimate $EV$s from limited or anonymized data and EV2BC to train policies by selectively imitating high-contribution agents, outperforming standard behavior cloning and offline RL baselines across ToC, Overcooked, and StarCraft scenarios. The approach enables learning from mixed-behavior datasets by privileging contributions to desirable outcomes, with practical implications for safe and effective multi-agent imitation in complex domains.

Abstract

AI agents are commonly trained with large datasets of demonstrations of human behavior. However, not all behaviors are equally safe or desirable. Desired characteristics for an AI agent can be expressed by assigning desirability scores, which we assume are not assigned to individual behaviors but to collective trajectories. For example, in a dataset of vehicle interactions, these scores might relate to the number of incidents that occurred. We first assess the effect of each individual agent's behavior on the collective desirability score, e.g., assessing how likely an agent is to cause incidents. This allows us to selectively imitate agents with a positive effect, e.g., only imitating agents that are unlikely to cause incidents. To enable this, we propose the concept of an agent's Exchange Value, which quantifies an individual agent's contribution to the collective desirability score. The Exchange Value is the expected change in desirability score when substituting the agent for a randomly selected agent. We propose additional methods for estimating Exchange Values from real-world datasets, enabling us to learn desired imitation policies that outperform relevant baselines. The project website can be found at https://tinyurl.com/select-to-perfect.

Select to Perfect: Imitating desired behavior from large multi-agent data

TL;DR

) as the agent-wise contribution measure to the DVF, and shows how

relates to Shapley Values while accommodating fixed-group-size constraints. The authors propose EV-Clustering to estimate

s from limited or anonymized data and EV2BC to train policies by selectively imitating high-contribution agents, outperforming standard behavior cloning and offline RL baselines across ToC, Overcooked, and StarCraft scenarios. The approach enables learning from mixed-behavior datasets by privileging contributions to desirable outcomes, with practical implications for safe and effective multi-agent imitation in complex domains.

Abstract

Paper Structure (47 sections, 3 theorems, 24 equations, 8 figures, 3 tables)

This paper contains 47 sections, 3 theorems, 24 equations, 8 figures, 3 tables.

Introduction
Related Work
Background and Notation
Markov Game.
Set of multi-agent demonstrations generated by many agents.
Shapley Values.
Methods
Problem setting.
Overview of the methods section.
Exchange Values to evaluate agents' individual contributions
Relationship between Shapley Value and Exchange Value.
Computing Exchange Values if only certain group sizes are permitted
Estimating Exchange Values from limited data
EV-Clustering identifies similar agents
Degeneracy of the credit assignment problem for fully-anonymized data
...and 32 more sections

Key Result

Proposition A.1

For any characteristic function game $G=(N, v)$ and every agent $i\in N$ we have

Figures (8)

Figure 1: We are given a dataset composed of multi-agent trajectories generated by many individual agents, e.g., a dataset of cars driving in urban environments. In addition, the Desired Value Function (DVF) indicates the desirability score of a collective trajectory, e.g., the number of incidents that occurred. We first compute the Exchange Value (EV) of each agent, where a positive EV indicates that an agent increases the desirability score (e.g. an agent driving safely). We reformulate imitation learning to take into account the computed EVs, and achieve an imitation policy that is aligned with the DVF (e.g. only imitating the behavior of safe drivers).
Figure 1: Resulting performance with respect to the DVF for different imitation learning methods in different Starcraft scenarios.
Figure 2: Overview of different characteristics of real-world datasets, and whether Shapley Values and Exchange Values (EVs) are applicable to compute contributions of individual agents to the DVF.
Figure 3: Mean error in estimating EVs with decreasing number of observations. 'Deg.' refers to the fully anonymized degenerate case. Error decreases significantly if agents are clustered (green-shaded area).
Figure 4: In the Overcooked environments Cramped Room (left) and Coordination Ring (right), agents must cooperate to cook and deliver as many soups as possible within a given time.
...and 3 more figures

Theorems & Definitions (14)

Definition 3.1: Characteristic function game
Definition 3.2: Shapley Value
Definition 4.1: Exchange Value
Definition 4.2: Constrained characteristic function game
Definition 4.3: Constrained Exchange Value
Definition 4.4: EV-Clustering
Definition 4.5: EV based Behavior Cloning (EV2BC)
Proposition A.1
proof
Example A.2
...and 4 more

Select to Perfect: Imitating desired behavior from large multi-agent data

TL;DR

Abstract

Select to Perfect: Imitating desired behavior from large multi-agent data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (14)