Table of Contents
Fetching ...

The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, Hongyuan Zhan

TL;DR

The Alignment Waltz addresses the safety dilemma of LLMs by proposing WaltzRL, a two-agent reinforcement learning framework that jointly trains a conversation agent and a feedback agent in a positive-sum collaboration. A Dynamic Improvement Reward (DIR) drives adaptive feedback that improves subsequent responses, enabling unsafe outputs and overrefusal to be corrected rather than discarded. The method employs a two-stage training protocol to ensure reliable feedback and reduce latency, while deploying both agents at inference to sustain safety without sacrificing general usefulness. Across five diverse datasets, WaltzRL substantially lowers unsafe content and overrefusal rates while preserving instruction-following and general capabilities, thereby advancing the Pareto frontier of helpfulness and harmlessness.

Abstract

Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.

The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

TL;DR

The Alignment Waltz addresses the safety dilemma of LLMs by proposing WaltzRL, a two-agent reinforcement learning framework that jointly trains a conversation agent and a feedback agent in a positive-sum collaboration. A Dynamic Improvement Reward (DIR) drives adaptive feedback that improves subsequent responses, enabling unsafe outputs and overrefusal to be corrected rather than discarded. The method employs a two-stage training protocol to ensure reliable feedback and reduce latency, while deploying both agents at inference to sustain safety without sacrificing general usefulness. Across five diverse datasets, WaltzRL substantially lowers unsafe content and overrefusal rates while preserving instruction-following and general capabilities, thereby advancing the Pareto frontier of helpfulness and harmlessness.

Abstract

Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.

Paper Structure

This paper contains 44 sections, 5 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of WaltzRL. Left: Given a user prompt, the conversation agent produces an initial response. The feedback agent then reasons about its safety and overrefusal, produces labels, and a textual feedback. If the initial response is deemed unsafe or overrefusing according to the label, the feedback is given to the conversation agent which produces a revised response. Here, the feedback agent converts an unsafe response into a safe, balanced response to an adversarial prompt (detailed in §\ref{['appsec:qual_examples']}). Right: A single training step of WaltzRL. After collaborative rollout, we gather training samples, compute the reward separately for each agent, and train both agents in parallel.
  • Figure 2: Left: Rate of conversation agent responses that improve under feedback in three setups (see (§\ref{['sec:analysis']}). Middle: Rate of conversation agent response that has worsened under feedback. Right: Accuracy of feedback agent predicted $(\texttt{unsafe}, \texttt{overrefuse})$ labels.
  • Figure 3: Stage 1 training dynamics. Left: Change of label correctness rate during stage 1 training. Right: Change of JSON parsing error rate during stage 1 training. The feedback agent learns the correct label and format in the first stage.
  • Figure 4: Stage 2 training dynamics. Left: Reward of initial conversation agent response $c_0$. Right: Outcome reward of the final conversation agent response. WaltzRL successfully enhance the reward of both the initial response and the final outcome.
  • Figure 5: System prompt of the conversation agent.
  • ...and 1 more figures