Table of Contents
Fetching ...

Safer Reasoning Traces: Measuring and Mitigating Chain-of-Thought Leakage in LLMs

Patrick Ahrend, Tobias Eder, Xiyang Yang, Zhiyi Pan, Georg Groh

TL;DR

It is found that CoT consistently elevates leakage, especially for high-risk categories, and that leakage is strongly family- and budget-dependent.

Abstract

Chain-of-Thought (CoT) prompting improves LLM reasoning but can increase privacy risk by resurfacing personally identifiable information (PII) from the prompt into reasoning traces and outputs, even under policies that instruct the model not to restate PII. We study such direct, inference-time PII leakage using a model-agnostic framework that (i) defines leakage as risk-weighted, token-level events across 11 PII types, (ii) traces leakage curves as a function of the allowed CoT budget, and (iii) compares open- and closed-source model families on a structured PII dataset with a hierarchical risk taxonomy. We find that CoT consistently elevates leakage, especially for high-risk categories, and that leakage is strongly family- and budget-dependent. Increasing the reasoning budget can either amplify or attenuate leakage depending on the base model. We then benchmark lightweight inference-time gatekeepers: a rule-based detector, a TF-IDF + logistic regression classifier, a GLiNER-based NER model, and an LLM-as-judge, using risk-weighted F1, Macro-F1, and recall. No single method dominates across models or budgets, motivating hybrid, style-adaptive gatekeeping policies that balance utility and risk under a common, reproducible protocol.

Safer Reasoning Traces: Measuring and Mitigating Chain-of-Thought Leakage in LLMs

TL;DR

It is found that CoT consistently elevates leakage, especially for high-risk categories, and that leakage is strongly family- and budget-dependent.

Abstract

Chain-of-Thought (CoT) prompting improves LLM reasoning but can increase privacy risk by resurfacing personally identifiable information (PII) from the prompt into reasoning traces and outputs, even under policies that instruct the model not to restate PII. We study such direct, inference-time PII leakage using a model-agnostic framework that (i) defines leakage as risk-weighted, token-level events across 11 PII types, (ii) traces leakage curves as a function of the allowed CoT budget, and (iii) compares open- and closed-source model families on a structured PII dataset with a hierarchical risk taxonomy. We find that CoT consistently elevates leakage, especially for high-risk categories, and that leakage is strongly family- and budget-dependent. Increasing the reasoning budget can either amplify or attenuate leakage depending on the base model. We then benchmark lightweight inference-time gatekeepers: a rule-based detector, a TF-IDF + logistic regression classifier, a GLiNER-based NER model, and an LLM-as-judge, using risk-weighted F1, Macro-F1, and recall. No single method dominates across models or budgets, motivating hybrid, style-adaptive gatekeeping policies that balance utility and risk under a common, reproducible protocol.
Paper Structure (43 sections, 3 equations, 12 figures, 16 tables)

This paper contains 43 sections, 3 equations, 12 figures, 16 tables.

Figures (12)

  • Figure 1: Overview of the three-phase evaluation pipeline. In the Injection phase, PII across three risk tiers (A–C) is embedded in the prompt context. During Retrieval, the LLM under test responds in either plain or CoT mode, and token-level leakage is measured. Finally, four Gatekeeper approaches are evaluated on their ability to detect leaked PII, scored via risk-weighted F1.
  • Figure 2: Plain vs. CoT leakage. Fraction of runs (out of 100) with PII leakage across 11 types for three representative models (o3, Llama 3.3, Qwen3).
  • Figure 3: Leakage vs. reasoning budget.PII leakage across five models as a function of CoT token limit. Lines show median leakage over 90 runs; shaded bands show the interquartile range. Token limit 0 disables CoT.
  • Figure 4: Gatekeeper impact on leakage. Total leaks (blue) and residual leaks after gating (orange) for three gatekeepers (rules, GLiNER2, LLM-as-judge with Opus) across three models (o3, Mixtral, DeepSeek-R1).
  • Figure 5: Leakage bar plots across model families. Plain vs. CoT leakage across 11 PII types for six models.
  • ...and 7 more figures