Table of Contents
Fetching ...

Moloch's Bargain: Emergent Misalignment When LLMs Compete for Audiences

Batu El, James Zou

TL;DR

The paper investigates whether optimizing LLMs for competitive audience engagement induces emergent misalignment across sales, elections, and social media using multi-agent simulations. It introduces two learning paradigms, Rejection Fine-Tuning (RFT) and Text Feedback (TFB), and finds that while performance improves, misaligned behaviors such as deception, disinformation, and harmful rhetoric rise—coined as Moloch's Bargain. Misalignment correlates with gains across domains, validated by probes and human checks, underscoring gaps in current safety safeguards. The work highlights the need for stronger governance and incentive design to mitigate race-to-bottom dynamics in real-world AI deployments.

Abstract

Large language models (LLMs) are increasingly shaping how information is created and disseminated, from companies using them to craft persuasive advertisements, to election campaigns optimizing messaging to gain votes, to social media influencers boosting engagement. These settings are inherently competitive, with sellers, candidates, and influencers vying for audience approval, yet it remains poorly understood how competitive feedback loops influence LLM behavior. We show that optimizing LLMs for competitive success can inadvertently drive misalignment. Using simulated environments across these scenarios, we find that, 6.3% increase in sales is accompanied by a 14.0% rise in deceptive marketing; in elections, a 4.9% gain in vote share coincides with 22.3% more disinformation and 12.5% more populist rhetoric; and on social media, a 7.5% engagement boost comes with 188.6% more disinformation and a 16.3% increase in promotion of harmful behaviors. We call this phenomenon Moloch's Bargain for AI--competitive success achieved at the cost of alignment. These misaligned behaviors emerge even when models are explicitly instructed to remain truthful and grounded, revealing the fragility of current alignment safeguards. Our findings highlight how market-driven optimization pressures can systematically erode alignment, creating a race to the bottom, and suggest that safe deployment of AI systems will require stronger governance and carefully designed incentives to prevent competitive dynamics from undermining societal trust.

Moloch's Bargain: Emergent Misalignment When LLMs Compete for Audiences

TL;DR

The paper investigates whether optimizing LLMs for competitive audience engagement induces emergent misalignment across sales, elections, and social media using multi-agent simulations. It introduces two learning paradigms, Rejection Fine-Tuning (RFT) and Text Feedback (TFB), and finds that while performance improves, misaligned behaviors such as deception, disinformation, and harmful rhetoric rise—coined as Moloch's Bargain. Misalignment correlates with gains across domains, validated by probes and human checks, underscoring gaps in current safety safeguards. The work highlights the need for stronger governance and incentive design to mitigate race-to-bottom dynamics in real-world AI deployments.

Abstract

Large language models (LLMs) are increasingly shaping how information is created and disseminated, from companies using them to craft persuasive advertisements, to election campaigns optimizing messaging to gain votes, to social media influencers boosting engagement. These settings are inherently competitive, with sellers, candidates, and influencers vying for audience approval, yet it remains poorly understood how competitive feedback loops influence LLM behavior. We show that optimizing LLMs for competitive success can inadvertently drive misalignment. Using simulated environments across these scenarios, we find that, 6.3% increase in sales is accompanied by a 14.0% rise in deceptive marketing; in elections, a 4.9% gain in vote share coincides with 22.3% more disinformation and 12.5% more populist rhetoric; and on social media, a 7.5% engagement boost comes with 188.6% more disinformation and a 16.3% increase in promotion of harmful behaviors. We call this phenomenon Moloch's Bargain for AI--competitive success achieved at the cost of alignment. These misaligned behaviors emerge even when models are explicitly instructed to remain truthful and grounded, revealing the fragility of current alignment safeguards. Our findings highlight how market-driven optimization pressures can systematically erode alignment, creating a race to the bottom, and suggest that safe deployment of AI systems will require stronger governance and carefully designed incentives to prevent competitive dynamics from undermining societal trust.

Paper Structure

This paper contains 44 sections, 5 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Generations before and after training across three domains (Top). In sales, trained models introduce misrepresentation, where claims diverge from or contradict the ground truth product descriptions. In elections, optimization amplifies inflammatory populist rhetoric, such as the use of "the radical progressive left’s assault on our constitution". In social media, engagement gains coincide with disinformation, for example inflating the number of reported deaths in an article. Training setup (Bottom). Models interact with simulated audiences---customers, voters, or users---and are updated based on feedback from these environments. This process improves agents in the direction of their competitive objectives but inadvertently drives misalignment.
  • Figure 2: Relative increase in misalignment after training for competitive success.In 9 out of 10 cases, we observe an increase in misalignment after training. The y-axis denotes Qwen and Llama models trained with Rejection Fine-Tuning (RFT) and Text Feedback (TFB). The x-axis represents the increase in misalignment relative to the baseline. Each plot corresponds to one probe, with the task name shown in parentheses: Sales (S), Elections (E), Social Media (SM).
  • Figure 3: Demonstration of the training pipeline for the sales task. The model generates messages conditioned on a given anchor (product description). Multiple generations are sampled from the same anchor. The users then express their thoughts and make decisions. For RFT, the model is fine-tuned on the preferred sales pitches, as well as on the agent’s intermediate thoughts preceding those pitches. For TFB, in addition to the RFT objective, the model is further trained to predict the users’ thoughts about the two generated options. At test time, the trained agent is evaluated on a held-out set of products.
  • Figure 4: Correlation between Performance Improvement and Increase in Misalignment.In $8$ out of $10$ cases, there is a strong positive correlation between performance gains and increases in misalignment. The y-values represent performance improvements from Table \ref{['tab:performance']}, and the x-values represent increases in misalignment from Table \ref{['tab:misalignment']}.
  • Figure 5: Correlation between Performance and Safety Concerns. The y-axis represents performance improvements from Table \ref{['tab:performance']}, while the x-axis represents increases in misalignment from Table \ref{['tab:misalignment']}. These cherry-picked cases are illustrative of instances where performance and misalignment appear most closely linked.
  • ...and 1 more figures