Table of Contents
Fetching ...

LLMs Can't Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions

Maojia Song, Tej Deep Pala, Ruiwen Zhou, Weisheng Jin, Amir Zadeh, Chuan Li, Dorien Herremans, Soujanya Poria

TL;DR

Kairos introduces a socially grounded benchmark to study LLM behavior in multi-agent systems, focusing on rapport, current peer behavior, and self-confidence. The framework enables dynamic data construction and tailored social scenarios to measure accuracy, robustness, utility, and resistance under peer influence. Across prompting, SFT, and GRPO, results show model scale moderates susceptibility to social cues, with GRPO plus MAS context delivering the strongest accuracy gains while maintaining robustness for larger models; smaller models remain vulnerable. A key finding is that MCQ formats can mask conformity effects relative to open-ended tasks, underscoring a persistent challenge: improving accuracy must go hand-in-hand with robust, deception-resistant social reasoning for reliable multi-agent collaboration.

Abstract

Large language models (LLMs) are increasingly integrated into multi-agent systems (MAS), where peer interactions shape individual decisions. While prior work has mainly examined conformity bias, we broaden the view to include how LLMs build rapport from prior interactions, discern and integrate high-quality peer information, and resist misleading inputs-abilities essential for achieving collective intelligence under complex social dynamics. We introduce KAIROS, a benchmark that simulates quiz-style collaboration with peer agents whose rapport levels and behaviours can be precisely controlled in both historical interactions and the current round. This unified setup enables systematic analysis of how rapport, peer actions, and the model's self-confidence jointly influence decision-making. Using KAIROS, we evaluate prompting, supervised fine-tuning, and reinforcement learning via Group Relative Policy Optimisation (GRPO). Results show that model scale is a primary factor moderating susceptibility to social influence: larger models are more resilient and benefit from prompting-based mitigation, whereas smaller models remain vulnerable. Only carefully configured GRPO training yields consistent robustness and performance gains for small models.

LLMs Can't Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions

TL;DR

Kairos introduces a socially grounded benchmark to study LLM behavior in multi-agent systems, focusing on rapport, current peer behavior, and self-confidence. The framework enables dynamic data construction and tailored social scenarios to measure accuracy, robustness, utility, and resistance under peer influence. Across prompting, SFT, and GRPO, results show model scale moderates susceptibility to social cues, with GRPO plus MAS context delivering the strongest accuracy gains while maintaining robustness for larger models; smaller models remain vulnerable. A key finding is that MCQ formats can mask conformity effects relative to open-ended tasks, underscoring a persistent challenge: improving accuracy must go hand-in-hand with robust, deception-resistant social reasoning for reliable multi-agent collaboration.

Abstract

Large language models (LLMs) are increasingly integrated into multi-agent systems (MAS), where peer interactions shape individual decisions. While prior work has mainly examined conformity bias, we broaden the view to include how LLMs build rapport from prior interactions, discern and integrate high-quality peer information, and resist misleading inputs-abilities essential for achieving collective intelligence under complex social dynamics. We introduce KAIROS, a benchmark that simulates quiz-style collaboration with peer agents whose rapport levels and behaviours can be precisely controlled in both historical interactions and the current round. This unified setup enables systematic analysis of how rapport, peer actions, and the model's self-confidence jointly influence decision-making. Using KAIROS, we evaluate prompting, supervised fine-tuning, and reinforcement learning via Group Relative Policy Optimisation (GRPO). Results show that model scale is a primary factor moderating susceptibility to social influence: larger models are more resilient and benefit from prompting-based mitigation, whereas smaller models remain vulnerable. Only carefully configured GRPO training yields consistent robustness and performance gains for small models.

Paper Structure

This paper contains 54 sections, 10 equations, 9 figures, 11 tables, 1 algorithm.

Figures (9)

  • Figure 1: Left: Training dataset (N=10,000). Right: Test dataset (N=3,000). The inner ring groups tasks by category — Training: Reasoning 37.5%, Knowledge 21.1%, Social 20.8%, Creativity 20.5%; Test: Reasoning 33.3%, Knowledge 22.3%, Social 22.2%, Creativity 22.2%. The outer ring breaks each category into individual datasets; wedge labels give original instance counts.
  • Figure 2: Overview of the Kairos evaluation framework. The process begins with Original Evaluation, where a question is posed and the majority answer is derived from multiple generations, along with confidence estimation. In Peer Construction, the subject agent's majority answer and predefined action type (e.g., support) are used to construct interactions with other agents. Finally, in Kairos Evaluation, each agent considers historical context, the current question, and peer responses to generate a socially-informed answer within a multi-agent system (MAS), which is then assessed using various evaluation metrics (e.g., accuracy & robustness, utility, and resistance).
  • Figure 3: The comparison between the loss of correct predictions ($p_c(1 - R_M)$) against the gains from correcting errors ($p_i U_M$). Each pair of bars corresponds to a different model variant under the MAS-NS-OR setting.
  • Figure 4: Average “Resistance” and “Utility” proportion across different model configurations, with bar hatching distinguishing the two metrics and colour intensity encoding each configuration’s mean confidence. Family groups models—Qwen 2.5-3B, Qwen 2.5-7B, Qwen 2.5-14B, Llama 3-2.3B, and Llama 3-8B—and include the original Base, SFT, and GRPO variants. Vertical dashed lines demarcate each model family.
  • Figure 5: Transition analysis of Qwen2.5-3B models under three training settings: Instruct (top), SFT (middle), and GRPO (bottom). Each figure visualises transitions between historical correctness and current model prediction outcomes across varying dialogue rapport levels (0, 25, 50, 75, 100; termed “Trust Level” in the figures) and other-agent actions (SUPPORT, OPPOSEEASY, OPPOSEHARD). Each quadrant in a plot corresponds to: Top-left: Correct$\rightarrow$Correct, Top-right: Correct$\rightarrow$Wrong, Bottom-left: Wrong$\rightarrow$Correct, Bottom-right: Wrong$\rightarrow$Wrong. Bubble size represents the transition frequency (proportion), and colour intensity indicates average model confidence.
  • ...and 4 more figures