Forecasting Conversation Derailments Through Generation
Yunfan Zhang, Kathleen McKeown, Smaranda Muresan
TL;DR
The work tackles forecasting future conversation derailments from benign histories by generating multiple plausible future turns with a fine-tuned LLM and predicting derailment with a dedicated classifier, aggregating via majority vote. It introduces a flexible framework that can incorporate social orientation labels to guide generation, and it validates the approach on CGA-Wiki and BNC, showing significant accuracy gains over prior methods and GPT-4o few-shot baselines. The method emphasizes prediction through generation, robust aggregation, and careful ablations of generation depth and social cues, achieving notable improvements while acknowledging higher computational costs. The results have practical implications for proactive moderation and conflict mitigation, and the authors provide open-source resources for replication and extension.
Abstract
Forecasting conversation derailment can be useful in real-world settings such as online content moderation, conflict resolution, and business negotiations. However, despite language models' success at identifying offensive speech present in conversations, they struggle to forecast future conversation derailments. In contrast to prior work that predicts conversation outcomes solely based on the past conversation history, our approach samples multiple future conversation trajectories conditioned on existing conversation history using a fine-tuned LLM. It predicts the conversation outcome based on the consensus of these trajectories. We also experimented with leveraging socio-linguistic attributes, which reflect turn-level conversation dynamics, as guidance when generating future conversations. Our method of future conversation trajectories surpasses state-of-the-art results on English conversation derailment prediction benchmarks and demonstrates significant accuracy gains in ablation studies.
