Toxicity Ahead: Forecasting Conversational Derailment on GitHub
Mia Mohammad Imran, Robert Zita, Rahat Rizvi Rahman, Preetha Chatterjee, Kostadin Damevski
TL;DR
The paper tackles proactive moderation of toxicity in GitHub discussions by forecasting conversational derailment using Least-to-Most prompts to generate Summaries of Conversation Dynamics (SCDs) and predicting derailment with large language models. It introduces a curated, multi-dataset setup capturing derailed-toxic and non-toxic conversations, shows that derailment signals such as sentiment shifts, tension triggers, and specific linguistic cues precede toxicity, and demonstrates that LtM-based SCDs yield state-of-the-art F1 scores (up to 0.901 on Qwen and 0.852 on Llama) with strong generalization to an independent Raman dataset. External validation confirms generalizability, and ablation analysis identifies sentiment dynamics and tension triggers as key contributors. The work provides practical, explainable moderation tools and actionable recommendations for deploying early warning systems in OSS communities, while outlining avenues for further benchmarking, efficiency improvements, and cross-platform validation.
Abstract
Toxic interactions in Open Source Software (OSS) communities reduce contributor engagement and threaten project sustainability. Preventing such toxicity before it emerges requires a clear understanding of how harmful conversations unfold. However, most proactive moderation strategies are manual, requiring significant time and effort from community maintainers. To support more scalable approaches, we curate a dataset of 159 derailed toxic threads and 207 non-toxic threads from GitHub discussions. Our analysis reveals that toxicity can be forecast by tension triggers, sentiment shifts, and specific conversational patterns. We present a novel Large Language Model (LLM)-based framework for predicting conversational derailment on GitHub using a two-step prompting pipeline. First, we generate \textit{Summaries of Conversation Dynamics} (SCDs) via Least-to-Most (LtM) prompting; then we use these summaries to estimate the \textit{likelihood of derailment}. Evaluated on Qwen and Llama models, our LtM strategy achieves F1-scores of 0.901 and 0.852, respectively, at a decision threshold of 0.3, outperforming established NLP baselines on conversation derailment. External validation on a dataset of 308 GitHub issue threads (65 toxic, 243 non-toxic) yields an F1-score up to 0.797. Our findings demonstrate the effectiveness of structured LLM prompting for early detection of conversational derailment in OSS, enabling proactive and explainable moderation.
