Table of Contents
Fetching ...

Teaching Models to Balance Resisting and Accepting Persuasion

Elias Stengel-Eskin, Peter Hase, Mohit Bansal

TL;DR

The work tackles the dual challenge of persuading LLMs to resist harmful inputs while remaining open to beneficial corrections. It introduces Persuasion-Balanced Training (PBT), which uses a multi-agent recursive tree data pipeline and a preference-based optimization objective to train models to both resist negative persuasion and accept positive persuasion. Across misinformation, flipflop, and multi-agent debate scenarios, PBT yields stronger, more stable performance than training for resistance or acceptance alone and transfers improvements to reasoning tasks like StrategyQA. The results suggest that balanced persuasion training can make LLMs more reliable teammates in collaborative and adversarial settings, with practical implications for safety and collaborative AI systems.

Abstract

Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Training (or PBT), which leverages multi-agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT allows us to use data generated from dialogues between smaller 7-8B models for training much larger 70B models. Moreover, PBT consistently improves resistance to misinformation and resilience to being challenged while also resulting in the best overall performance on holistic data containing both positive and negative persuasion. Crucially, we show that PBT models are better teammates in multi-agent debates across two domains (trivia and commonsense QA). We find that without PBT, pairs of stronger and weaker models have unstable performance, with the order in which the models present their answers determining whether the team obtains the stronger or weaker model's performance. PBT leads to better and more stable results and less order dependence, with the stronger model consistently pulling the weaker one up.

Teaching Models to Balance Resisting and Accepting Persuasion

TL;DR

The work tackles the dual challenge of persuading LLMs to resist harmful inputs while remaining open to beneficial corrections. It introduces Persuasion-Balanced Training (PBT), which uses a multi-agent recursive tree data pipeline and a preference-based optimization objective to train models to both resist negative persuasion and accept positive persuasion. Across misinformation, flipflop, and multi-agent debate scenarios, PBT yields stronger, more stable performance than training for resistance or acceptance alone and transfers improvements to reasoning tasks like StrategyQA. The results suggest that balanced persuasion training can make LLMs more reliable teammates in collaborative and adversarial settings, with practical implications for safety and collaborative AI systems.

Abstract

Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Training (or PBT), which leverages multi-agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT allows us to use data generated from dialogues between smaller 7-8B models for training much larger 70B models. Moreover, PBT consistently improves resistance to misinformation and resilience to being challenged while also resulting in the best overall performance on holistic data containing both positive and negative persuasion. Crucially, we show that PBT models are better teammates in multi-agent debates across two domains (trivia and commonsense QA). We find that without PBT, pairs of stronger and weaker models have unstable performance, with the order in which the models present their answers determining whether the team obtains the stronger or weaker model's performance. PBT leads to better and more stable results and less order dependence, with the stronger model consistently pulling the weaker one up.

Paper Structure

This paper contains 25 sections, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Resisting negative persuasion and accepting positive persuasion are both needed for productive dialogues. However, only optimizing one or the other can lead to overcorrection. We argue that the two must be balanced, i.e. the agent should resist and accept persuasion when appropriate.
  • Figure 2: Overview of our multi-agent recursive tree-based method. Preference pairs are obtained by rolling out dialogues between agents with different roles, producing counterfactual responses with different scores. We balance these pairs use them to train models with PBT.
  • Figure 3: Accuracy of a team after discussion. A strong model (Llama 3.1 70B) paired with a weaker model (Llama 3.1 8B) leads to order dependence. Accept-only and resist-only training fail to address this variance and hurt team performance, but PBT leads to strong performance regardless of which model goes first.
  • Figure 4: Baseline and team performance for Base-Base, Base-Accept, and Base-PBT teams. Base-Base and Base-Accept have larger drops depending on which teammate goes first. PBT has more consistent team performance, with the rightmost green bars being most similar to the 70B solo performance.
  • Figure 5: Qualitative examples from each model. Accept and resist-only work in one direction (positive or negative persuasion) but not the other. PBT works for both types of persuasion.
  • ...and 4 more figures