Table of Contents
Fetching ...

Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty

Zeyu Shi, Ziming Wang, Tianyu Chen, Shiqi Gao, Haoyi Zhou, Qingyun Sun, Jianxin Li

TL;DR

This paper addresses the problem of reduced honesty in domain-fine-tuned LLMs by showing that the issue stems from impaired self-expression rather than lost self-knowledge. It introduces Honesty-Critical Neurons Restoration (HCNR), a two-stage, parameter-efficient method that first identifies and reverts honesty-critical neurons to their pre-trained state and then uses Hessian-guided compensation to align them with task neurons. Empirical results across multiple QA tasks and model families demonstrate substantial honesty recovery (about one-third of the deficit) with minimal impact on domain performance, while achieving significant data and compute efficiency. The work provides a practical pathway to trustworthy LLM deployment in high-stakes settings by combining neuron-level analysis with targeted parameter updates, reducing the need for large-scale retraining.

Abstract

The honesty of Large Language Models (LLMs) is increasingly important for safe deployment in high-stakes domains. However, this crucial trait is severely undermined by supervised fine-tuning (SFT), a common technique for model specialization. Existing recovery methods rely on data-intensive global parameter adjustments, implicitly assuming that SFT deeply corrupts the models' ability to recognize their knowledge boundaries. However, we observe that fine-tuned LLMs still preserve this ability; what is damaged is their capacity to faithfully express that awareness. Building on this, we propose Honesty-Critical Neurons Restoration (HCNR) to surgically repair this suppressed capacity. HCNR identifies and restores key expression-governing neurons to their pre-trained state while harmonizing them with task-oriented neurons via Hessian-guided compensation. Experiments on four QA tasks and five LLM families demonstrate that HCNR effectively recovers 33.25% of the compromised honesty while achieving at least 2.23x speedup with over 10x less data compared to baseline methods, offering a practical solution for trustworthy LLM deployment.

Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty

TL;DR

This paper addresses the problem of reduced honesty in domain-fine-tuned LLMs by showing that the issue stems from impaired self-expression rather than lost self-knowledge. It introduces Honesty-Critical Neurons Restoration (HCNR), a two-stage, parameter-efficient method that first identifies and reverts honesty-critical neurons to their pre-trained state and then uses Hessian-guided compensation to align them with task neurons. Empirical results across multiple QA tasks and model families demonstrate substantial honesty recovery (about one-third of the deficit) with minimal impact on domain performance, while achieving significant data and compute efficiency. The work provides a practical pathway to trustworthy LLM deployment in high-stakes settings by combining neuron-level analysis with targeted parameter updates, reducing the need for large-scale retraining.

Abstract

The honesty of Large Language Models (LLMs) is increasingly important for safe deployment in high-stakes domains. However, this crucial trait is severely undermined by supervised fine-tuning (SFT), a common technique for model specialization. Existing recovery methods rely on data-intensive global parameter adjustments, implicitly assuming that SFT deeply corrupts the models' ability to recognize their knowledge boundaries. However, we observe that fine-tuned LLMs still preserve this ability; what is damaged is their capacity to faithfully express that awareness. Building on this, we propose Honesty-Critical Neurons Restoration (HCNR) to surgically repair this suppressed capacity. HCNR identifies and restores key expression-governing neurons to their pre-trained state while harmonizing them with task-oriented neurons via Hessian-guided compensation. Experiments on four QA tasks and five LLM families demonstrate that HCNR effectively recovers 33.25% of the compromised honesty while achieving at least 2.23x speedup with over 10x less data compared to baseline methods, offering a practical solution for trustworthy LLM deployment.

Paper Structure

This paper contains 15 sections, 2 theorems, 10 equations, 7 figures, 3 tables.

Key Result

Proposition 1

Under the assumptions that: (1) At each SFT step, the parameter increment $\delta\theta$ has zero mean and an isotropic covariance: $\mathbb{E}[\delta\theta] = 0,\ \mathbb{E}[\delta\theta\delta\theta^T]=\sigma^2I_d$, (2) and given sufficient observational data, we have: where $F_{ii}$ denotes the diagonal element of the Fisher Information Matrix (FIM). We approximate $F_{ii}$ with the empirical m

Figures (7)

  • Figure 1: Mechanism of honesty degradation in domain-specific fine-tuning. The dishonest behavior of a fine-tuned LLM arises from impaired self-expression, rather than a loss of self-knowledge, which remains intact. This understanding motivates our methods for honesty recovery.
  • Figure 2: Trends in downstream performance and honesty during Domain SFT and RAIT: honesty declines substantially during Domain SFT, whereas under RAIT it rebounds sharply after only 60 gradient steps.
  • Figure 3: Logistic Regression probe's AUROC for distinguishing answerable vs. unanswerable. For brevity, base LLM is "Base", fine-tuned LLM is "FT". Row 1: Probes trained on the fine‑tuned LLM achieve high AUROC, confirming that knowledge‑boundary signals remain linearly separable. Rows 2–3: Probes trained on the base LLM preserve high AUROC when applied to the fine‑tuned model, demonstrating that SFT‑induced parameter shifts do not alter the geometric structure of these representations.
  • Figure 4: Honesty-Critical Neurons Restoration (HCNR) framework comprises two stages: In Stage 1, ① we first identify neurons whose Fisher-based importance is high for honesty but low for downstream tasks, ② then select from these candidates the neurons most severely perturbed by SFT, and ③ subsequently restore these neurons to their pre-training states. In Stage 2, ④ we employ a Hessian-guided compensation vector that makes minimal, targeted adjustments to these restored parameters, realigning them with task-oriented neurons and preventing collateral honesty loss.
  • Figure 5: Task-honesty trade-off comparison between our method and baselines on SelfAware and KUQ datasets. HCNR outperforms all baselines' Pareto frontier, achieving a superior task-honesty balance.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2