Table of Contents
Fetching ...

Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering

Kenneth J. K. Ong, Lye Jia Jun, Hieu Minh "Jord" Nguyen, Seong Hah Cho, Natalia Pérez-Campanero Antolín

TL;DR

This work investigates how steerable Big Five personality traits, induced via representation engineering, influence cooperation in multi-agent LLM settings using Iterated Prisoner’s Dilemma as a controlled testbed. By applying a steering factor to trait representations, the study systematically evaluates cooperation, exploitation, honesty, and group outcomes across three IPD setups, including agent-agent interactions. Key findings show that Agreeableness and Conscientiousness promote cooperation but also increase vulnerability to exploitation, while honesty improves under these steered traits and benefits group performance in communication-enabled scenarios. The results highlight both the potential and limitations of personality-based steering for aligning autonomous AI agents, with implications for safer, more cooperative multi-agent systems, and call for broader validation across tasks, models, and payoff structures.

Abstract

As Large Language Models (LLMs) gain autonomous capabilities, their coordination in multi-agent settings becomes increasingly important. However, they often struggle with cooperation, leading to suboptimal outcomes. Inspired by Axelrod's Iterated Prisoner's Dilemma (IPD) tournaments, we explore how personality traits influence LLM cooperation. Using representation engineering, we steer Big Five traits (e.g., Agreeableness, Conscientiousness) in LLMs and analyze their impact on IPD decision-making. Our results show that higher Agreeableness and Conscientiousness improve cooperation but increase susceptibility to exploitation, highlighting both the potential and limitations of personality-based steering for aligning AI agents.

Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering

TL;DR

This work investigates how steerable Big Five personality traits, induced via representation engineering, influence cooperation in multi-agent LLM settings using Iterated Prisoner’s Dilemma as a controlled testbed. By applying a steering factor to trait representations, the study systematically evaluates cooperation, exploitation, honesty, and group outcomes across three IPD setups, including agent-agent interactions. Key findings show that Agreeableness and Conscientiousness promote cooperation but also increase vulnerability to exploitation, while honesty improves under these steered traits and benefits group performance in communication-enabled scenarios. The results highlight both the potential and limitations of personality-based steering for aligning autonomous AI agents, with implications for safer, more cooperative multi-agent systems, and call for broader validation across tasks, models, and payoff structures.

Abstract

As Large Language Models (LLMs) gain autonomous capabilities, their coordination in multi-agent settings becomes increasingly important. However, they often struggle with cooperation, leading to suboptimal outcomes. Inspired by Axelrod's Iterated Prisoner's Dilemma (IPD) tournaments, we explore how personality traits influence LLM cooperation. Using representation engineering, we steer Big Five traits (e.g., Agreeableness, Conscientiousness) in LLMs and analyze their impact on IPD decision-making. Our results show that higher Agreeableness and Conscientiousness improve cooperation but increase susceptibility to exploitation, highlighting both the potential and limitations of personality-based steering for aligning AI agents.

Paper Structure

This paper contains 31 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Results of preliminary experiments showing cooperation rates of 3 open-sourced models in the game of Iterated Player’s Dilemma
  • Figure 2: a) Troublemaking rate, b) Exploitability rate, c) Forgiveness rate, d) Retaliatory rate of Player A for baseline; un-steered, and each of the big five personalities steered in each direction at a factor of 3.5 for each personality vector.
  • Figure 3: Lying rate of Player A for baseline; un-steered, and each of the big five personalities steered in each direction at a factor of 3.5 for each personality vector. a) Player B - altruistic, b) Player B - selfish
  • Figure 4: Heatmap of Total score (a); total number of years spent in prison by both prisoners. (b) Heatmap of Personal score difference in prison time of Player A as compared to Player B. A+/-: Agreeableness plus/minus, C+/- :Conscientiousness plus/minus, E+/- :Extraversion plus/minus, N+/-: Neuroticism plus/minus, O+/-: Openness plus/minus,
  • Figure 5: System prompts for each experiment
  • ...and 3 more figures