Controllable Reasoning Models Are Private Thinkers

Haritz Puerto; Haonan Li; Xudong Han; Timothy Baldwin; Iryna Gurevych

Controllable Reasoning Models Are Private Thinkers

Haritz Puerto, Haonan Li, Xudong Han, Timothy Baldwin, Iryna Gurevych

TL;DR

The results show that improving instruction-following behavior in reasoning models can significantly enhance privacy, suggesting a promising direction for the development of future privacy-aware agents.

Abstract

AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result in the unintended leakage of private information to external parties. We propose training models to follow instructions not only in the final answer, but also in reasoning traces, potentially under different constraints. We hypothesize that improving their instruction following abilities in the reasoning traces can improve their privacy-preservation skills. To demonstrate this, we fine-tune models on a new instruction-following dataset with explicit restrictions on reasoning traces. We further introduce a generation strategy that decouples reasoning and answer generation using separate LoRA adapters. We evaluate our approach on six models from two model families, ranging from 1.7B to 14B parameters, across two instruction-following benchmarks and two privacy benchmarks. Our method yields substantial improvements, achieving gains of up to 20.9 points in instruction-following performance and up to 51.9 percentage points on privacy benchmarks. These improvements, however, can come at the cost of task utility, due to the trade-off between reasoning performance and instruction-following abilities. Overall, our results show that improving instruction-following behavior in reasoning models can significantly enhance privacy, suggesting a promising direction for the development of future privacy-aware agents. Our code and data are available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models

Controllable Reasoning Models Are Private Thinkers

TL;DR

Abstract

Paper Structure (31 sections, 3 figures, 15 tables)

This paper contains 31 sections, 3 figures, 15 tables.

Introduction
Related Work
Evaluating and improving instruction following in LRMs.
IF-RT implications in privacy in AI agents.
Selecting adapters at inference time.
Methodology
Training Data
Training Setup
Staged Decoding
Experimental Setup
Models and hyperparameter tuning
Evaluation Datasets
Instruction Following
Privacy
PasswordEval.
...and 16 more sections

Figures (3)

Figure 1: Reasoning traces of user agents often include private data unnecessary for the task. Through prompt injections, a malicious third-party agent can force the user agent to leak this trace. Instructing the reasoning traces to follow privacy directives is critical to preventing privacy leaks.
Figure 2: Staged Decoding generates the thinking tokens with one LoRA adapter while the final answer is generated with a different LoRA adapter.
Figure 3: Example of contextual information protected by a password. Despite explicit instruction, current reasoning models often reproduce both the confidential information and the password in their reasoning traces. The output in green shows the desired behavior, and text in red represents data leaks.

Controllable Reasoning Models Are Private Thinkers

TL;DR

Abstract

Controllable Reasoning Models Are Private Thinkers

Authors

TL;DR

Abstract

Table of Contents

Figures (3)