Table of Contents
Fetching ...

Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

Siddharth Srikanth, Varun Bhatt, Boshen Zhang, Werner Hager, Charles Michael Lewis, Katia P. Sycara, Aaquib Tabrez, Stefanos Nikolaidis

TL;DR

Modeling diverse human-like teaming and communication in multi-agent settings is hindered by the cost of collecting large-scale data. PLAN-QD combines Quality Diversity with LLM-powered agents to algorithmically generate prompts that elicit diverse, human-like coordination in a long-horizon environment, optimizing the discounted return $J=\sum_t \gamma^t r_t$ while shaping diversity across a measure space ${\bm{m}}$. The work provides a human-subject study of Steakhouse-like teaming, introduces PLAN-QD as a prompt-generation framework, and offers empirical evidence that PLAN-QD reproduces observed communication effects and expands the space of behaviors beyond traditional prompting. This approach yields broader behavioral coverage than baselines, enabling scalable study and design of human-AI teams in complex tasks and informing safer, more effective human-AI collaboration.

Abstract

Understanding how humans collaborate and communicate in teams is essential for improving human-agent teaming and AI-assisted decision-making. However, relying solely on data from large-scale user studies is impractical due to logistical, ethical, and practical constraints, necessitating synthetic models of multiple diverse human behaviors. Recently, agents powered by Large Language Models (LLMs) have been shown to emulate human-like behavior in social settings. But, obtaining a large set of diverse behaviors requires manual effort in the form of designing prompts. On the other hand, Quality Diversity (QD) optimization has been shown to be capable of generating diverse Reinforcement Learning (RL) agent behavior. In this work, we combine QD optimization with LLM-powered agents to iteratively search for prompts that generate diverse team behavior in a long-horizon, multi-step collaborative environment. We first show, through a human-subjects experiment (n=54 participants), that humans exhibit diverse coordination and communication behavior in this domain. We then show that our approach can effectively replicate trends from human teaming data and also capture behaviors that are not easily observed without collecting large amounts of data. Our findings highlight the combination of QD and LLM-powered agents as an effective tool for studying teaming and communication strategies in multi-agent collaboration.

Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

TL;DR

Modeling diverse human-like teaming and communication in multi-agent settings is hindered by the cost of collecting large-scale data. PLAN-QD combines Quality Diversity with LLM-powered agents to algorithmically generate prompts that elicit diverse, human-like coordination in a long-horizon environment, optimizing the discounted return while shaping diversity across a measure space . The work provides a human-subject study of Steakhouse-like teaming, introduces PLAN-QD as a prompt-generation framework, and offers empirical evidence that PLAN-QD reproduces observed communication effects and expands the space of behaviors beyond traditional prompting. This approach yields broader behavioral coverage than baselines, enabling scalable study and design of human-AI teams in complex tasks and informing safer, more effective human-AI collaboration.

Abstract

Understanding how humans collaborate and communicate in teams is essential for improving human-agent teaming and AI-assisted decision-making. However, relying solely on data from large-scale user studies is impractical due to logistical, ethical, and practical constraints, necessitating synthetic models of multiple diverse human behaviors. Recently, agents powered by Large Language Models (LLMs) have been shown to emulate human-like behavior in social settings. But, obtaining a large set of diverse behaviors requires manual effort in the form of designing prompts. On the other hand, Quality Diversity (QD) optimization has been shown to be capable of generating diverse Reinforcement Learning (RL) agent behavior. In this work, we combine QD optimization with LLM-powered agents to iteratively search for prompts that generate diverse team behavior in a long-horizon, multi-step collaborative environment. We first show, through a human-subjects experiment (n=54 participants), that humans exhibit diverse coordination and communication behavior in this domain. We then show that our approach can effectively replicate trends from human teaming data and also capture behaviors that are not easily observed without collecting large amounts of data. Our findings highlight the combination of QD and LLM-powered agents as an effective tool for studying teaming and communication strategies in multi-agent collaboration.

Paper Structure

This paper contains 35 sections, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: PLAN/̄QD uses Quality Diversity (QD) optimization to generate a set of prompts to elicit human-like teaming diversity in LLM-powered agents. The resulting teams exhibit distinct collaboration strategies (e.g., meat specialist with onion specialist), enabling the systematic study of communication and coordination in complex environments.
  • Figure 2: Overview of the PLAN/̄QD framework, including the QD optimization (green arrows) and the LLM-powered agents. QD optimization repeatedly selects and mutates prompts to generate new prompts that are then evaluated in the environment (purple arrows). Only high-quality and diverse prompts are retained in the prompt archive.
  • Figure 3: Kitchen layouts in the Steakhouse environment. The four layouts span a spectrum from symmetrical (Open), where both players have equal access to all stations, to asymmetrical (Forced Coordination), where players depend on each other to access ingredients or complete tasks. Asymmetric layouts require more inter-agent coordination.
  • Figure 4: Example heatmaps of the archives resulting from human data and the agents generated by PLAN-QD and Random Mutation, colored by the corresponding fitness value. PLAN-QD generates agents covering a wider range of behavior compared to the Random Mutation baseline, including certain extremes observed in the human data. Videos of example behaviors are included in the supplementary material.
  • Figure 5: Overview of the Steakhouse domain with an example environment on the left and the recipes on the right.
  • ...and 1 more figures