Table of Contents
Fetching ...

Frictional Agent Alignment Framework: Slow Down and Don't Break Things

Abhijnan Nath, Carine Graff, Andrei Bachinin, Nikhil Krishnaswamy

TL;DR

FAAF addresses misalignment in dynamic, collaborative dialogue by introducing a state-conditioned two-policy framework that jointly models frictive states and friction interventions. The authors derive an analytical, single-policy-friendly objective with a simple supervised loss, and validate FAAF on DeliData and Weights Task Dataset variants, showing robust OOD generalization and superior friction quality compared with RLHF-based baselines. Data augmentation with GPT-4o and human validation underpin empirical gains, while ablations demonstrate the necessity of conditioning on frictive state in achieving strong performance. Overall, FAAF advances the idea of LLMs as adaptive thought partners that slow down dialogue strategically to improve reasoning and accountability in human-AI collaboration.

Abstract

AI support of collaborative interactions entails mediating potential misalignment between interlocutor beliefs. Common preference alignment methods like DPO excel in static settings, but struggle in dynamic collaborative tasks where the explicit signals of interlocutor beliefs are sparse and skewed. We propose the Frictional Agent Alignment Framework (FAAF), to generate precise, context-aware "friction" that prompts for deliberation and re-examination of existing evidence. FAAF's two-player objective decouples from data skew: a frictive-state policy identifies belief misalignments, while an intervention policy crafts collaborator-preferred responses. We derive an analytical solution to this objective, enabling training a single policy via a simple supervised loss. Experiments on three benchmarks show FAAF outperforms competitors in producing concise, interpretable friction and in OOD generalization. By aligning LLMs to act as adaptive "thought partners" -- not passive responders -- FAAF advances scalable, dynamic human-AI collaboration. Our code and data can be found at https://github.com/csu-signal/FAAF_ACL.

Frictional Agent Alignment Framework: Slow Down and Don't Break Things

TL;DR

FAAF addresses misalignment in dynamic, collaborative dialogue by introducing a state-conditioned two-policy framework that jointly models frictive states and friction interventions. The authors derive an analytical, single-policy-friendly objective with a simple supervised loss, and validate FAAF on DeliData and Weights Task Dataset variants, showing robust OOD generalization and superior friction quality compared with RLHF-based baselines. Data augmentation with GPT-4o and human validation underpin empirical gains, while ablations demonstrate the necessity of conditioning on frictive state in achieving strong performance. Overall, FAAF advances the idea of LLMs as adaptive thought partners that slow down dialogue strategically to improve reasoning and accountability in human-AI collaboration.

Abstract

AI support of collaborative interactions entails mediating potential misalignment between interlocutor beliefs. Common preference alignment methods like DPO excel in static settings, but struggle in dynamic collaborative tasks where the explicit signals of interlocutor beliefs are sparse and skewed. We propose the Frictional Agent Alignment Framework (FAAF), to generate precise, context-aware "friction" that prompts for deliberation and re-examination of existing evidence. FAAF's two-player objective decouples from data skew: a frictive-state policy identifies belief misalignments, while an intervention policy crafts collaborator-preferred responses. We derive an analytical solution to this objective, enabling training a single policy via a simple supervised loss. Experiments on three benchmarks show FAAF outperforms competitors in producing concise, interpretable friction and in OOD generalization. By aligning LLMs to act as adaptive "thought partners" -- not passive responders -- FAAF advances scalable, dynamic human-AI collaboration. Our code and data can be found at https://github.com/csu-signal/FAAF_ACL.

Paper Structure

This paper contains 42 sections, 6 theorems, 59 equations, 9 figures, 12 tables, 1 algorithm.

Key Result

Lemma 1

When substituting the optimal friction intervention policy $\pi_f^*$, as derived in Eq. eq:optimal_frictive_agent_policy, into Eq. eq:two_stage_main_objective, the objective in Eq. eq:two_stage_main_objective reduces to:

Figures (9)

  • Figure 1: FAAF conditions responses on both the dialogue context $x$ and representation of the "frictive" (belief) state $\phi$, to generate outputs that prompt for reflection, deliberation, and verification of evidence.
  • Figure 2: DeliData karadzhov2023delidata Friction Generation Prompt. We use GPT-4o as our sampling distribution $\mu$ and prompt it to simultaneously generate frictive states and friction interventions. For diversity, we use the default temperature of 1. This process implicitly provides us with preference rankings between intervention, via the reward scores. See Sec. \ref{['sec:defs']} for definitions of frictive states and friction interventions. Note that we exclude already-present “probing” interventions in this generation process since are present in the original DeliData annotations.
  • Figure 3: Weights Task dataset khebour-etal-2024-common Friction Generation Prompt. We use GPT-4o as our sampling distribution $\mu$ and prompt it to simultaneously generate frictive states and friction interventions. For diversity, we use the default temperature of 1.
  • Figure 4: “Simulated” Weights Task dataset (WTD Simulated) Friction Generation Prompt.To ground these friction interventions with personality-traits of the participants, we use mao2024editing's prompting framework with personality-facet combinations. We use GPT-4o as our sampling distribution $\mu$ and prompt it to simultaneously generate frictive states and friction interventions. For diversity, we use the default temperature of 1.
  • Figure 5: Plots showing distributional differences between WTD original and simulated data. Dialogue contexts (top) show clear separation with both within-group similarities exceeding across-group similarity ($\bar{x}=0.581$). In contrast, friction interventions (bottom) exhibit weaker separation with across-group similarity ($\bar{x}=0.400$) falling between within-group values.
  • ...and 4 more figures

Theorems & Definitions (13)

  • proof
  • Lemma 1: Value of Inner Maximization
  • proof
  • Theorem 2: Uniqueness of FAAF Empirical Loss
  • proof
  • Lemma 3: Sequential Choice Decomposition in Friction Agent Optimization
  • proof
  • Lemma 4: Uniqueness of Intervention Thresholds
  • proof
  • Corollary 5: Uniqueness of Optimal Policy Under Threshold Identity
  • ...and 3 more