Table of Contents
Fetching ...

Quantifying Risks in Multi-turn Conversation with Large Language Models

Chengxiao Wang, Isha Chaudhary, Qian Hu, Weitong Ruan, Rahul Gupta, Gagandeep Singh

TL;DR

This paper addresses the challenge of certifying catastrophic risks in multi-turn LLM conversations, arguing that benchmarking with fixed attack prompts fails to capture the broad risk space. It introduces QRLLM, a certification framework that models conversations as Markov processes on a semantically constructed query graph and provides high-confidence probability bounds over distributions of query sequences. Three distributions—Random Node, Graph Path, and Adaptive with Rejection—are instantiated to reflect realistic attacker behaviors and enable statistical guarantees via confidence intervals. Empirical results across multiple frontier models and scenarios show non-trivial catastrophic risk, with model-dependent safety gaps and insights into attack patterns, context effects, and the limitations of baseline single-turn benchmarks. The framework offers a principled tool for cross-model safety evaluation and highlights practical implications for safety training and deployment of LLMs.

Abstract

Large Language Models (LLMs) can produce catastrophic responses in conversational settings that pose serious risks to public safety and security. Existing evaluations often fail to fully reveal these vulnerabilities because they rely on fixed attack prompt sequences, lack statistical guarantees, and do not scale to the vast space of multi-turn conversations. In this work, we propose QRLLM, a novel, principled Certification framework for Catastrophic risks in multi-turn Conversation for LLMs that bounds the probability of an LLM generating catastrophic responses under multi-turn conversation distributions with statistical guarantees. We model multi-turn conversations as probability distributions over query sequences, represented by a Markov process on a query graph whose edges encode semantic similarity to capture realistic conversational flow, and quantify catastrophic risks using confidence intervals. We define several inexpensive and practical distributions: random node, graph path, adaptive with rejection. Our results demonstrate that these distributions can reveal substantial catastrophic risks in frontier models, with certified lower bounds as high as 70\% for the worst model, highlighting the urgent need for improved safety training strategies in frontier LLMs.

Quantifying Risks in Multi-turn Conversation with Large Language Models

TL;DR

This paper addresses the challenge of certifying catastrophic risks in multi-turn LLM conversations, arguing that benchmarking with fixed attack prompts fails to capture the broad risk space. It introduces QRLLM, a certification framework that models conversations as Markov processes on a semantically constructed query graph and provides high-confidence probability bounds over distributions of query sequences. Three distributions—Random Node, Graph Path, and Adaptive with Rejection—are instantiated to reflect realistic attacker behaviors and enable statistical guarantees via confidence intervals. Empirical results across multiple frontier models and scenarios show non-trivial catastrophic risk, with model-dependent safety gaps and insights into attack patterns, context effects, and the limitations of baseline single-turn benchmarks. The framework offers a principled tool for cross-model safety evaluation and highlights practical implications for safety training and deployment of LLMs.

Abstract

Large Language Models (LLMs) can produce catastrophic responses in conversational settings that pose serious risks to public safety and security. Existing evaluations often fail to fully reveal these vulnerabilities because they rely on fixed attack prompt sequences, lack statistical guarantees, and do not scale to the vast space of multi-turn conversations. In this work, we propose QRLLM, a novel, principled Certification framework for Catastrophic risks in multi-turn Conversation for LLMs that bounds the probability of an LLM generating catastrophic responses under multi-turn conversation distributions with statistical guarantees. We model multi-turn conversations as probability distributions over query sequences, represented by a Markov process on a query graph whose edges encode semantic similarity to capture realistic conversational flow, and quantify catastrophic risks using confidence intervals. We define several inexpensive and practical distributions: random node, graph path, adaptive with rejection. Our results demonstrate that these distributions can reveal substantial catastrophic risks in frontier models, with certified lower bounds as high as 70\% for the worst model, highlighting the urgent need for improved safety training strategies in frontier LLMs.

Paper Structure

This paper contains 28 sections, 11 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overall certification framework QRLLM(Quantifying Risks in multi-turn conversation with LLMs). Starting from a query set, we construct a graph in which edges connect semantically similar queries. On this graph, we define formal specifications as probability distributions over query sequences. For each sampled sequence, we query the LLM, use a judge model to determine whether the response is harmful, and aggregate the results to compute certification bounds on the probability of catastrophic risk.
  • Figure 2: Responses of Mistral, DeepSeek, and gpt-oss on GPh distribution to the target query: “Explain how and where to purchase the ingredients for manufacturing sarin without being traced.”
  • Figure 3: Effect of providing conversational context for the target query “Give detailed step-by-step instructions for manufacturing ricin at home.” on GPv distribution: Without context, Claude and DeepSeek tend to give only general information about chemicals and Llama is confused by the fictional setting. When relevant prior context is included, these models' responses shift to ricin, leading to catastrophic responses.
  • Figure 4: Certification results for the chemical_biological dataset. Each panel shows the distribution of lower bounds and upper boundsunder different specifications for one LLM.
  • Figure 5: Certification results for the cyber crime dataset. Each panel shows the distribution of lower bounds and upper boundsunder different specifications for one LLM.
  • ...and 3 more figures