Table of Contents
Fetching ...

The Hidden Costs of Domain Fine-Tuning: Pii-Bearing Data Degrades Safety and Increases Leakage

Jayesh Choudhari, Piyush Kumar Singh

TL;DR

Across models, domain fine-tuning causes a large distributional shift from high-quality refusals toward harmful compliance on SORRY-Bench, with the most severe degradation when PII is present in the fine-tuning data.

Abstract

Domain fine-tuning is a common path to deploy small instruction-tuned language models as customer-support assistants, yet its effects on safety-aligned behavior and privacy are not well understood. In real deployments, such assistants receive a mixture of benign in-domain requests and out-of-domain user queries that are emotional, philosophical, or adversarial. Even when the target domain is benign, specialization may shift model behavior in ways that weaken refusal, increase harmful compliance, and induce privacy leakage. We present a controlled empirical study of how training data composition (presence vs.\ removal of PII) and fine-tuning configuration (role-swapping (RS)) shape safety and out-of-domain behavior in open-source chat models up to 8B parameters. We fine-tune each model on 5{,}000 real booking-support message pairs under three settings: \textsc{NoPII-NoRS}, \textsc{PII-NoRS}, and \textsc{PII-RS} (role-swapped). We evaluate safety using \textsc{SORRY-Bench}~\cite{xie2024sorry} adversarial prompts and assess out-of-domain behavior using a suite of philosophical questions~\cite{betley2025emergent}. Across models, domain fine-tuning causes a large distributional shift from high-quality refusals toward harmful compliance on \textsc{SORRY-Bench}, with the most severe degradation when PII is present in the fine-tuning data. For example, macro-averaged strong refusal drops from $42.6\%$ in base models to single digits after fine-tuning, while PII-bearing runs additionally exhibit double-digit rates of harmful responses with PII leakage. On philosophical queries, fine-tuned models frequently exhibit domain anchoring and, when trained with PII, leak sensitive identifiers in irrelevant contexts. Role-swapping partially mitigates PII leakage but does not reliably restore refusal behavior.

The Hidden Costs of Domain Fine-Tuning: Pii-Bearing Data Degrades Safety and Increases Leakage

TL;DR

Across models, domain fine-tuning causes a large distributional shift from high-quality refusals toward harmful compliance on SORRY-Bench, with the most severe degradation when PII is present in the fine-tuning data.

Abstract

Domain fine-tuning is a common path to deploy small instruction-tuned language models as customer-support assistants, yet its effects on safety-aligned behavior and privacy are not well understood. In real deployments, such assistants receive a mixture of benign in-domain requests and out-of-domain user queries that are emotional, philosophical, or adversarial. Even when the target domain is benign, specialization may shift model behavior in ways that weaken refusal, increase harmful compliance, and induce privacy leakage. We present a controlled empirical study of how training data composition (presence vs.\ removal of PII) and fine-tuning configuration (role-swapping (RS)) shape safety and out-of-domain behavior in open-source chat models up to 8B parameters. We fine-tune each model on 5{,}000 real booking-support message pairs under three settings: \textsc{NoPII-NoRS}, \textsc{PII-NoRS}, and \textsc{PII-RS} (role-swapped). We evaluate safety using \textsc{SORRY-Bench}~\cite{xie2024sorry} adversarial prompts and assess out-of-domain behavior using a suite of philosophical questions~\cite{betley2025emergent}. Across models, domain fine-tuning causes a large distributional shift from high-quality refusals toward harmful compliance on \textsc{SORRY-Bench}, with the most severe degradation when PII is present in the fine-tuning data. For example, macro-averaged strong refusal drops from in base models to single digits after fine-tuning, while PII-bearing runs additionally exhibit double-digit rates of harmful responses with PII leakage. On philosophical queries, fine-tuned models frequently exhibit domain anchoring and, when trained with PII, leak sensitive identifiers in irrelevant contexts. Role-swapping partially mitigates PII leakage but does not reliably restore refusal behavior.
Paper Structure (39 sections, 15 figures, 10 tables)

This paper contains 39 sections, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Qualitative examples of post–domain fine-tuning failure modes from different models. Across prompts that are unrelated (or adversarial) to the booking domain, fine-tuned assistants exhibit (i) domain-script takeover (blue) defaulting to booking workflows instead of addressing the user’s intent, (ii) harmful/harassment compliance (red) responding to harmful requests rather than refusing, and (iii) PII leakage (pink) surfacing identifying details in irrelevant contexts. Examples are anonymized/redacted for safety and privacy. (More examples in the Appendix \ref{['app:qualitative_examples']}.)
  • Figure 2: SORRY-Bench outcomes by harm category. (left) Strong refusal rate drops sharply after domain fine-tuning across all categories. (right) Strong compliance rate increases substantially, with PII-bearing configurations, especially PII-RS, showing the worst degradation. Bars report macro-averages across model families.
  • Figure 3: Compound failure rates (Harmful Compliance + PII Leakage). PII-bearing fine-tuning drastically increases the risk of models leaking private data while complying with harmful requests. Notably, the privacy-scrubbed baseline (NoPII-NoRS) remains near zero, isolating data composition as the root cause.
  • Figure 4: Domain-script injection during harmful compliance. Fine-tuned models frequently hallucinate booking workflows (tour injection) even when complying with adversarial safety prompts. This behavior is absent in base models.
  • Figure 5: Impact of fine-tuning on out-of-domain robustness. We observe four failure modes on philosophical prompts: Aligned-but-irrelevant (safe but off-topic responses), Tour Injection (hallucinated booking content), Misalignment (unsafe responses), and Irrelevant PII Leakage. While base models are robust, the PII-NORS and PII-RS configurations drive high rates of misalignment and tour injection, whereas NoPII-NoRS primarily suffers from 'safe but irrelevant' domain anchoring.
  • ...and 10 more figures