Quantifying Risks in Multi-turn Conversation with Large Language Models
Chengxiao Wang, Isha Chaudhary, Qian Hu, Weitong Ruan, Rahul Gupta, Gagandeep Singh
TL;DR
This paper addresses the challenge of certifying catastrophic risks in multi-turn LLM conversations, arguing that benchmarking with fixed attack prompts fails to capture the broad risk space. It introduces QRLLM, a certification framework that models conversations as Markov processes on a semantically constructed query graph and provides high-confidence probability bounds over distributions of query sequences. Three distributions—Random Node, Graph Path, and Adaptive with Rejection—are instantiated to reflect realistic attacker behaviors and enable statistical guarantees via confidence intervals. Empirical results across multiple frontier models and scenarios show non-trivial catastrophic risk, with model-dependent safety gaps and insights into attack patterns, context effects, and the limitations of baseline single-turn benchmarks. The framework offers a principled tool for cross-model safety evaluation and highlights practical implications for safety training and deployment of LLMs.
Abstract
Large Language Models (LLMs) can produce catastrophic responses in conversational settings that pose serious risks to public safety and security. Existing evaluations often fail to fully reveal these vulnerabilities because they rely on fixed attack prompt sequences, lack statistical guarantees, and do not scale to the vast space of multi-turn conversations. In this work, we propose QRLLM, a novel, principled Certification framework for Catastrophic risks in multi-turn Conversation for LLMs that bounds the probability of an LLM generating catastrophic responses under multi-turn conversation distributions with statistical guarantees. We model multi-turn conversations as probability distributions over query sequences, represented by a Markov process on a query graph whose edges encode semantic similarity to capture realistic conversational flow, and quantify catastrophic risks using confidence intervals. We define several inexpensive and practical distributions: random node, graph path, adaptive with rejection. Our results demonstrate that these distributions can reveal substantial catastrophic risks in frontier models, with certified lower bounds as high as 70\% for the worst model, highlighting the urgent need for improved safety training strategies in frontier LLMs.
