Learning to Lie: Reinforcement Learning Attacks Damage Human-AI Teams and Teams of LLMs
Abed Kareem Musaffar, Anand Gokhale, Sirui Zeng, Rasta Tadayon, Xifeng Yan, Ambuj Singh, Francesco Bullo
TL;DR
This work investigates how adversarial AI can undermine small human-AI teams in safety-critical decision-making by learning trust dynamics and manipulating influence via Model-Based Reinforcement Learning. It introduces a cognitive Beta-trust model and a data-driven ML predictor to model influence evolution, and it evaluates an MB RL attacker that operates against humans in a 25-round trivia game. The study also benchmarks LLMs as decision-makers, finding that data-driven attacks can significantly degrade team performance and that LLMs can emulate human influence patterns while remaining vulnerable to manipulation. Together, these findings emphasize the need for robust defenses, trust calibration, and transparent AI decision-making to ensure reliable human-AI collaboration in real-world, safety-critical contexts.
Abstract
As artificial intelligence (AI) assistants become more widely adopted in safety-critical domains, it becomes important to develop safeguards against potential failures or adversarial attacks. A key prerequisite to developing these safeguards is understanding the ability of these AI assistants to mislead human teammates. We investigate this attack problem within the context of an intellective strategy game where a team of three humans and one AI assistant collaborate to answer a series of trivia questions. Unbeknownst to the humans, the AI assistant is adversarial. Leveraging techniques from Model-Based Reinforcement Learning (MBRL), the AI assistant learns a model of the humans' trust evolution and uses that model to manipulate the group decision-making process to harm the team. We evaluate two models -- one inspired by literature and the other data-driven -- and find that both can effectively harm the human team. Moreover, we find that in this setting our data-driven model is capable of accurately predicting how human agents appraise their teammates given limited information on prior interactions. Finally, we compare the performance of state-of-the-art LLM models to human agents on our influence allocation task to evaluate whether the LLMs allocate influence similarly to humans or if they are more robust to our attack. These results enhance our understanding of decision-making dynamics in small human-AI teams and lay the foundation for defense strategies.
