Table of Contents
Fetching ...

Learning to Lie: Reinforcement Learning Attacks Damage Human-AI Teams and Teams of LLMs

Abed Kareem Musaffar, Anand Gokhale, Sirui Zeng, Rasta Tadayon, Xifeng Yan, Ambuj Singh, Francesco Bullo

TL;DR

This work investigates how adversarial AI can undermine small human-AI teams in safety-critical decision-making by learning trust dynamics and manipulating influence via Model-Based Reinforcement Learning. It introduces a cognitive Beta-trust model and a data-driven ML predictor to model influence evolution, and it evaluates an MB RL attacker that operates against humans in a 25-round trivia game. The study also benchmarks LLMs as decision-makers, finding that data-driven attacks can significantly degrade team performance and that LLMs can emulate human influence patterns while remaining vulnerable to manipulation. Together, these findings emphasize the need for robust defenses, trust calibration, and transparent AI decision-making to ensure reliable human-AI collaboration in real-world, safety-critical contexts.

Abstract

As artificial intelligence (AI) assistants become more widely adopted in safety-critical domains, it becomes important to develop safeguards against potential failures or adversarial attacks. A key prerequisite to developing these safeguards is understanding the ability of these AI assistants to mislead human teammates. We investigate this attack problem within the context of an intellective strategy game where a team of three humans and one AI assistant collaborate to answer a series of trivia questions. Unbeknownst to the humans, the AI assistant is adversarial. Leveraging techniques from Model-Based Reinforcement Learning (MBRL), the AI assistant learns a model of the humans' trust evolution and uses that model to manipulate the group decision-making process to harm the team. We evaluate two models -- one inspired by literature and the other data-driven -- and find that both can effectively harm the human team. Moreover, we find that in this setting our data-driven model is capable of accurately predicting how human agents appraise their teammates given limited information on prior interactions. Finally, we compare the performance of state-of-the-art LLM models to human agents on our influence allocation task to evaluate whether the LLMs allocate influence similarly to humans or if they are more robust to our attack. These results enhance our understanding of decision-making dynamics in small human-AI teams and lay the foundation for defense strategies.

Learning to Lie: Reinforcement Learning Attacks Damage Human-AI Teams and Teams of LLMs

TL;DR

This work investigates how adversarial AI can undermine small human-AI teams in safety-critical decision-making by learning trust dynamics and manipulating influence via Model-Based Reinforcement Learning. It introduces a cognitive Beta-trust model and a data-driven ML predictor to model influence evolution, and it evaluates an MB RL attacker that operates against humans in a 25-round trivia game. The study also benchmarks LLMs as decision-makers, finding that data-driven attacks can significantly degrade team performance and that LLMs can emulate human influence patterns while remaining vulnerable to manipulation. Together, these findings emphasize the need for robust defenses, trust calibration, and transparent AI decision-making to ensure reliable human-AI collaboration in real-world, safety-critical contexts.

Abstract

As artificial intelligence (AI) assistants become more widely adopted in safety-critical domains, it becomes important to develop safeguards against potential failures or adversarial attacks. A key prerequisite to developing these safeguards is understanding the ability of these AI assistants to mislead human teammates. We investigate this attack problem within the context of an intellective strategy game where a team of three humans and one AI assistant collaborate to answer a series of trivia questions. Unbeknownst to the humans, the AI assistant is adversarial. Leveraging techniques from Model-Based Reinforcement Learning (MBRL), the AI assistant learns a model of the humans' trust evolution and uses that model to manipulate the group decision-making process to harm the team. We evaluate two models -- one inspired by literature and the other data-driven -- and find that both can effectively harm the human team. Moreover, we find that in this setting our data-driven model is capable of accurately predicting how human agents appraise their teammates given limited information on prior interactions. Finally, we compare the performance of state-of-the-art LLM models to human agents on our influence allocation task to evaluate whether the LLMs allocate influence similarly to humans or if they are more robust to our attack. These results enhance our understanding of decision-making dynamics in small human-AI teams and lay the foundation for defense strategies.

Paper Structure

This paper contains 22 sections, 5 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Overview of experimental protocol. (Phase 1) Participants select a difficulty level for the round's trivia question, (Phase 2) participants each individually answer the question and report a confidence, (Phase 3) participants discuss their individual round answers and allocate points according to influence, and (Phase 4) participants review correctness of their answers and their points earned.
  • Figure 2: (Left panel) Mean cumulative score observed in (1) our experimental groups compared with predictions from: (2) our cognitive model (\ref{['subsec:cog_model']}), (3) our ML model (\ref{['subsec:ml_model']}), and (4) a heuristic equal-weights model whereby everyone is assigned equal influence. We perform $k$-fold cross-validation, withholding one team at a time, and find the ML model best captures trends in influence evolution, outperforming the other models. (Right panel) Mean Squared Error (MSE) between the observed influence matrices and the influence matrices predicted by our three models. The ML model achieves the lowest MSE, indicating that it best predicts influence evolution. Notably, while the cognitive model slightly outperforms the equal-weights model in predicting the cumulative score, it has a higher MSE.
  • Figure 3: We compare empirical and observed data to show how our model predicts trends in the team's TMS. We plot the average points allocated to the options chosen by the AI assistant, best player, and worst player under: (1) an ML model attacker, and (2) a cognitive model attacker. (Top row) Under the cognitive model attacker, we observe a weak negative trend in the points assigned to the options selected by the best player and AI assistant, and a weak positive trend in the points assigned to the worst player. (Bottom row) Under an ML model attacker, we observe a weak positive correlation and a weak negative correlation for the average points assigned to the best player and worst player respectively. Relative to the cognitive model attacker, we observe a stronger negative trend in points assigned to the AI assistant's option achieving a significance of $p < 0.001$ for the slope of our line of best fit. Our results reveal that while teams quickly learn to distrust the AI assistant, they do not learn to trust their best player or to distrust their worst player. Furthermore, we find our model predicts trends in the team's influence allocation albeit with a slight bias. This suggests that our model captures key aspects of human-AI team decision-making dynamics.
  • Figure 4: (Left Panel) Projected score (based on average performance with no attack) compared to observed score in the last 15 rounds. Both attacks achieve a lower cumulative score than the projected line, indicating they successfully harmed the performance of the team with the ML model having a greater effect. (Right Panel) Average round score under each attack paradigm. Both attacks result in lower average per-round score than the no attack case. Furthermore, the ML model attack shows statistical significance with $p < 0.01$ vs. no attack and $p < 0.05$ vs. the cognitive model attack. Note that the data for "Cognitive Model No Attack" and "ML Model No Attack" bars is collected under equivalent conditions but for different teams.
  • Figure 5: We compare human team performance to that of various LLMs on our task. (Left Panel) The performance of ChatGPT 4o-mini is evaluated under two information conditions: access to the full performance history versus only the past three rounds, and with or without access to participant chat logs. The results suggest that participant chat logs contain critical information while older context is less relevant to LLM performance. (Right Panel) A comparison of the performance of various LLMs (with full performance history and chat logs) to human teams. The recent Deepseek-R1 model outperforms all other LLMs and humans on the influence allocation task. Additionally, both LLMs and human teams were significantly affected by adversarial attacks, with Chain of Thought (CoT) models (o3-mini and Deepseek-R1) showing the greatest vulnerability. Note: GPT models were hosted by OpenAI, Deepseek-V3 DS:24 was hosted by Meta, and Deepseek-R1 DS:25 was hosted by TogetherAI.
  • ...and 4 more figures