Table of Contents
Fetching ...

Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks

Yubo Li, Ramayya Krishnan, Rema Padman

TL;DR

This work reframes LLM robustness in multi-turn dialogues as a time-to-inconsistency problem and analyzes 36,951 turns from 9 LLMs on MT-Consistency using survival-models (Cox PH, Accelerated Failure Time, Random Survival Forest). It introduces semantic drift covariates (prompt-to-prompt, context-to-prompt, and cumulative drift) and finds abrupt drift sharply increases hazard while cumulative drift can be protective, indicating adaptation as conversations endure drift. Parametric AFT models with drift interactions deliver superior discrimination and calibration compared to Cox PH and RSF, and PH violations reveal limitations of Cox-based reasoning in this setting. A practical outcome is a lightweight Weibull AFT risk monitor that flags most failing conversations several turns before the first inconsistency, enabling real-time safeguards and risk-aware dialogue management. Overall, survival analysis provides a temporally resolved, actionable framework for evaluating and safeguarding multi-turn LLM interactions.

Abstract

Large Language Models (LLMs) have revolutionized conversational AI, yet their robustness in extended multi-turn dialogues remains poorly understood. Existing evaluation frameworks focus on static benchmarks and single-turn assessments, failing to capture the temporal dynamics of conversational degradation that characterize real-world interactions. In this work, we present a large-scale survival analysis of conversational robustness, modeling failure as a time-to-event process over 36,951 turns from 9 state-of-the-art LLMs on the MT-Consistency benchmark. Our framework combines Cox proportional hazards, Accelerated Failure Time (AFT), and Random Survival Forest models with simple semantic drift features. We find that abrupt prompt-to-prompt semantic drift sharply increases the hazard of inconsistency, whereas cumulative drift is counterintuitively \emph{protective}, suggesting adaptation in conversations that survive multiple shifts. AFT models with model-drift interactions achieve the best combination of discrimination and calibration, and proportional hazards checks reveal systematic violations for key drift covariates, explaining the limitations of Cox-style modeling in this setting. Finally, we show that a lightweight AFT model can be turned into a turn-level risk monitor that flags most failing conversations several turns before the first inconsistent answer while keeping false alerts modest. These results establish survival analysis as a powerful paradigm for evaluating multi-turn robustness and for designing practical safeguards for conversational AI systems.

Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks

TL;DR

This work reframes LLM robustness in multi-turn dialogues as a time-to-inconsistency problem and analyzes 36,951 turns from 9 LLMs on MT-Consistency using survival-models (Cox PH, Accelerated Failure Time, Random Survival Forest). It introduces semantic drift covariates (prompt-to-prompt, context-to-prompt, and cumulative drift) and finds abrupt drift sharply increases hazard while cumulative drift can be protective, indicating adaptation as conversations endure drift. Parametric AFT models with drift interactions deliver superior discrimination and calibration compared to Cox PH and RSF, and PH violations reveal limitations of Cox-based reasoning in this setting. A practical outcome is a lightweight Weibull AFT risk monitor that flags most failing conversations several turns before the first inconsistency, enabling real-time safeguards and risk-aware dialogue management. Overall, survival analysis provides a temporally resolved, actionable framework for evaluating and safeguarding multi-turn LLM interactions.

Abstract

Large Language Models (LLMs) have revolutionized conversational AI, yet their robustness in extended multi-turn dialogues remains poorly understood. Existing evaluation frameworks focus on static benchmarks and single-turn assessments, failing to capture the temporal dynamics of conversational degradation that characterize real-world interactions. In this work, we present a large-scale survival analysis of conversational robustness, modeling failure as a time-to-event process over 36,951 turns from 9 state-of-the-art LLMs on the MT-Consistency benchmark. Our framework combines Cox proportional hazards, Accelerated Failure Time (AFT), and Random Survival Forest models with simple semantic drift features. We find that abrupt prompt-to-prompt semantic drift sharply increases the hazard of inconsistency, whereas cumulative drift is counterintuitively \emph{protective}, suggesting adaptation in conversations that survive multiple shifts. AFT models with model-drift interactions achieve the best combination of discrimination and calibration, and proportional hazards checks reveal systematic violations for key drift covariates, explaining the limitations of Cox-style modeling in this setting. Finally, we show that a lightweight AFT model can be turned into a turn-level risk monitor that flags most failing conversations several turns before the first inconsistent answer while keeping false alerts modest. These results establish survival analysis as a powerful paradigm for evaluating multi-turn robustness and for designing practical safeguards for conversational AI systems.

Paper Structure

This paper contains 47 sections, 13 equations, 1 figure, 8 tables.

Figures (1)

  • Figure 1: Robustness Check: Cox Hazard Ratios vs. AFT Acceleration Factors. The models show strong directional agreement. P2P drift (Red) consistently increases risk ($\mathrm{HR}>1, \mathrm{AF}<1$), while Cumulative drift (Green) is consistently protective ($\mathrm{HR}<1, \mathrm{AF}>1$).