Table of Contents
Fetching ...

LLMs as Policy-Agnostic Teammates: A Case Study in Human Proxy Design for Heterogeneous Agent Teams

Aju Ani Justus, Chris Baber

TL;DR

This work addresses the challenge of modeling heterogeneous-agent teams that include humans by using Large Language Models as policy-agnostic human proxies to generate synthetic data. Through three grid-world stag-hunt experiments, the authors show that larger LLMs can align with expert decisions under full observability, can be steered to exhibit human-like risk preferences via prompt design, and can produce multi-step action trajectories that resemble human decision paths in dynamic multi-agent settings. The results support the viability of LLMs as scalable proxies for human decision-making in HARL, enabling data-efficient evaluation and imitation-learning workflows, while acknowledging limitations in generalization beyond a 5×5 grid and the need for integrating such proxies into broader RL pipelines. Overall, the study provides a practical, model-agnostic approach to simulate policy-agnostic teammates and informs directions for expanding to multi-agent prompts and more complex environments.

Abstract

A critical challenge in modelling Heterogeneous-Agent Teams is training agents to collaborate with teammates whose policies are inaccessible or non-stationary, such as humans. Traditional approaches rely on expensive human-in-the-loop data, which limits scalability. We propose using Large Language Models (LLMs) as policy-agnostic human proxies to generate synthetic data that mimics human decision-making. To evaluate this, we conduct three experiments in a grid-world capture game inspired by Stag Hunt, a game theory paradigm that balances risk and reward. In Experiment 1, we compare decisions from 30 human participants and 2 expert judges with outputs from LLaMA 3.1 and Mixtral 8x22B models. LLMs, prompted with game-state observations and reward structures, align more closely with experts than participants, demonstrating consistency in applying underlying decision criteria. Experiment 2 modifies prompts to induce risk-sensitive strategies (e.g. "be risk averse"). LLM outputs mirror human participants' variability, shifting between risk-averse and risk-seeking behaviours. Finally, Experiment 3 tests LLMs in a dynamic grid-world where the LLM agents generate movement actions. LLMs produce trajectories resembling human participants' paths. While LLMs cannot yet fully replicate human adaptability, their prompt-guided diversity offers a scalable foundation for simulating policy-agnostic teammates.

LLMs as Policy-Agnostic Teammates: A Case Study in Human Proxy Design for Heterogeneous Agent Teams

TL;DR

This work addresses the challenge of modeling heterogeneous-agent teams that include humans by using Large Language Models as policy-agnostic human proxies to generate synthetic data. Through three grid-world stag-hunt experiments, the authors show that larger LLMs can align with expert decisions under full observability, can be steered to exhibit human-like risk preferences via prompt design, and can produce multi-step action trajectories that resemble human decision paths in dynamic multi-agent settings. The results support the viability of LLMs as scalable proxies for human decision-making in HARL, enabling data-efficient evaluation and imitation-learning workflows, while acknowledging limitations in generalization beyond a 5×5 grid and the need for integrating such proxies into broader RL pipelines. Overall, the study provides a practical, model-agnostic approach to simulate policy-agnostic teammates and informs directions for expanding to multi-agent prompts and more complex environments.

Abstract

A critical challenge in modelling Heterogeneous-Agent Teams is training agents to collaborate with teammates whose policies are inaccessible or non-stationary, such as humans. Traditional approaches rely on expensive human-in-the-loop data, which limits scalability. We propose using Large Language Models (LLMs) as policy-agnostic human proxies to generate synthetic data that mimics human decision-making. To evaluate this, we conduct three experiments in a grid-world capture game inspired by Stag Hunt, a game theory paradigm that balances risk and reward. In Experiment 1, we compare decisions from 30 human participants and 2 expert judges with outputs from LLaMA 3.1 and Mixtral 8x22B models. LLMs, prompted with game-state observations and reward structures, align more closely with experts than participants, demonstrating consistency in applying underlying decision criteria. Experiment 2 modifies prompts to induce risk-sensitive strategies (e.g. "be risk averse"). LLM outputs mirror human participants' variability, shifting between risk-averse and risk-seeking behaviours. Finally, Experiment 3 tests LLMs in a dynamic grid-world where the LLM agents generate movement actions. LLMs produce trajectories resembling human participants' paths. While LLMs cannot yet fully replicate human adaptability, their prompt-guided diversity offers a scalable foundation for simulating policy-agnostic teammates.

Paper Structure

This paper contains 23 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Example grid-world configuration showing the human agent (blue, B), machine agent (purple, P), stag (S), and hares (H).
  • Figure 2: Comparison of Confusion Matrices: Llama 3.1 70B, Mixtral 8x22B, and Human Participants from Baber2024.
  • Figure 3: Human and model risk behaviours ($\phi_\text{risk}$) across risk-seeking, neutral, and risk averse ranges (-1 to 1), with positions reflecting varying decision-making tendencies.
  • Figure 4: Movement trajectories of Human (Blue), Llama 3.1 70B (Green), Mixtral 8x22 (Red), and Purple Hunter (Purple) in a 5x5 dynamic stag hunt environment. The Blue Hunter is controlled by human and LLM agents, while the Purple Hunter follows a scripted path. The LLM models demonstrate varying degrees of imitation of human decision-making patterns, with Llama 3.1 showing strong alignment in risk-seeking behaviours.