Collaborate, Deliberate, Evaluate: How LLM Alignment Affects Coordinated Multi-Agent Outcomes
Abhijnan Nath, Carine Graff, Nikhil Krishnaswamy
TL;DR
This work investigates how LLM alignment methods influence coordinated multi-agent outcomes by modeling interventions as friction in a Modified-Action MDP ($\mathcal{M}_f$) and evaluating them via roleplay on two tasks: the DeliData Wason Card Task and the Weights Task. The authors introduce FAAF, a friction-aware alignment approach, and show it outperforms traditional baselines (SFT, DPO, IPO, PPO, BC) in terms of common-ground convergence and task accuracy, especially under action-modification conditions where collaborators can reinterpret or ignore interventions. Through data generation, human validation, and an extensive roleplay evaluation, they demonstrate that incorporating friction and explicit modeling of action modification yields more robust, deliberative collaboration and better task outcomes. The results highlight the importance of evaluating alignment in realistic, long-horizon, multi-agent settings and suggest friction-enabled strategies can improve reliability and accountability in human-AI collaboration. These findings have practical implications for deploying AI collaborators in complex teamwork, with potential impact on collaborative reasoning, decision support, and AI governance.
Abstract
As Large Language Models (LLMs) get integrated into diverse workflows, they are increasingly being regarded as "collaborators" with humans, and required to work in coordination with other AI systems. If such AI collaborators are to reliably coordinate their actions and behaviors with humans or other AIs, their properties and behaviors over multi-turn interactions must be known and predictable. This paper examines how different alignment methods affect LLM agents' effectiveness as partners in multi-turn, multi-party collaborations. We study this question through the lens of intervention agents that insert themselves into group dialogues not to provide answers, but to encourage the collaborative group to slow down and reflect upon their reasoning for deliberative decision-making. Common alignment techniques are typically developed under simplified single-user settings and assume the optimality of the underlying token MDP. Using the theoretical lens of the modified-action MDP, we show how they do not account for the dynamics of long-horizon multi-party interactions. We present a novel roleplay simulation methodology, where we align LLMs according to different methods and then deploy them in collaborative task dialogues to quantify how interventions affect the trajectory of group collaboration, belief alignment, and coordination. Our results show that an intervention agent that is robust to action modification significantly outperforms common alignment baselines in supporting correct task outcomes.
