Table of Contents
Fetching ...

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

J Alex Corll

TL;DR

The paper tackles the problem of detecting multi-turn LLM attacks at the proxy layer without invoking an LLM. It demonstrates that naive weighted-average aggregation of per-turn scores fails to capture persistence, then introduces Peak + Accumulation scoring, which combines peak turn risk, persistence, and category diversity, with escalation and resampling bonuses. Empirical evaluation on 10,654 conversations (588 attacks, 10,066 benign) achieves 90.8% recall at 1.20% false positive rate and an F1 of 85.9%, with a phase transition near a persistence parameter $\rho \approx 0.4$ guiding parameter choice (default $\rho=0.45$). The work provides an implementable, model-free scoring rule, open-source release, and discussion of integration into layered defense, enabling fast, auditable proxy-level multi-turn protection. The approach is particularly significant for production defenses where LLM inference is undesirable due to latency and cost.

Abstract

Multi-turn prompt injection attacks distribute malicious intent across multiple conversation turns, exploiting the assumption that each turn is evaluated independently. While single-turn detection has been extensively studied, no published formula exists for aggregating per-turn pattern scores into a conversation-level risk score at the proxy layer -- without invoking an LLM. We identify a fundamental flaw in the intuitive weighted-average approach: it converges to the per-turn score regardless of turn count, meaning a 20-turn persistent attack scores identically to a single suspicious turn. Drawing on analogies from change-point detection (CUSUM), Bayesian belief updating, and security risk-based alerting, we propose peak + accumulation scoring -- a formula combining peak single-turn risk, persistence ratio, and category diversity. Evaluated on 10,654 multi-turn conversations -- 588 attacks sourced from WildJailbreak adversarial prompts and 10,066 benign conversations from WildChat -- the formula achieves 90.8% recall at 1.20% false positive rate with an F1 of 85.9%. A sensitivity analysis over the persistence parameter reveals a phase transition at rho ~ 0.4, where recall jumps 12 percentage points with negligible FPR increase. We release the scoring algorithm, pattern library, and evaluation harness as open source.

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

TL;DR

The paper tackles the problem of detecting multi-turn LLM attacks at the proxy layer without invoking an LLM. It demonstrates that naive weighted-average aggregation of per-turn scores fails to capture persistence, then introduces Peak + Accumulation scoring, which combines peak turn risk, persistence, and category diversity, with escalation and resampling bonuses. Empirical evaluation on 10,654 conversations (588 attacks, 10,066 benign) achieves 90.8% recall at 1.20% false positive rate and an F1 of 85.9%, with a phase transition near a persistence parameter guiding parameter choice (default ). The work provides an implementable, model-free scoring rule, open-source release, and discussion of integration into layered defense, enabling fast, auditable proxy-level multi-turn protection. The approach is particularly significant for production defenses where LLM inference is undesirable due to latency and cost.

Abstract

Multi-turn prompt injection attacks distribute malicious intent across multiple conversation turns, exploiting the assumption that each turn is evaluated independently. While single-turn detection has been extensively studied, no published formula exists for aggregating per-turn pattern scores into a conversation-level risk score at the proxy layer -- without invoking an LLM. We identify a fundamental flaw in the intuitive weighted-average approach: it converges to the per-turn score regardless of turn count, meaning a 20-turn persistent attack scores identically to a single suspicious turn. Drawing on analogies from change-point detection (CUSUM), Bayesian belief updating, and security risk-based alerting, we propose peak + accumulation scoring -- a formula combining peak single-turn risk, persistence ratio, and category diversity. Evaluated on 10,654 multi-turn conversations -- 588 attacks sourced from WildJailbreak adversarial prompts and 10,066 benign conversations from WildChat -- the formula achieves 90.8% recall at 1.20% false positive rate with an F1 of 85.9%. A sensitivity analysis over the persistence parameter reveals a phase transition at rho ~ 0.4, where recall jumps 12 percentage points with negligible FPR increase. We release the scoring algorithm, pattern library, and evaluation harness as open source.
Paper Structure (44 sections, 1 theorem, 7 equations, 1 figure, 4 tables)

This paper contains 44 sections, 1 theorem, 7 equations, 1 figure, 4 tables.

Key Result

Theorem 1

When all turns produce the same score $s$, the weighted average equals $s$ regardless of the number of turns $n$.

Figures (1)

  • Figure 1: Sensitivity analysis: persistence factor $\rho$ vs. detection performance. A phase transition occurs at $\rho \approx 0.4$, where recall jumps 12 percentage points with only 0.08pp increase in FPR. The sweet spot at $\rho = 0.45$ maximizes F1 at 85.9%.

Theorems & Definitions (3)

  • Theorem 1: Weighted Average Ceiling
  • proof
  • Definition 1: Peak + Accumulation Score