Fine-Tuned LLMs Know They Don't Know: A Parameter-Efficient Approach to Recovering Honesty
Zeyu Shi, Ziming Wang, Tianyu Chen, Shiqi Gao, Haoyi Zhou, Qingyun Sun, Jianxin Li
TL;DR
This paper addresses the problem of reduced honesty in domain-fine-tuned LLMs by showing that the issue stems from impaired self-expression rather than lost self-knowledge. It introduces Honesty-Critical Neurons Restoration (HCNR), a two-stage, parameter-efficient method that first identifies and reverts honesty-critical neurons to their pre-trained state and then uses Hessian-guided compensation to align them with task neurons. Empirical results across multiple QA tasks and model families demonstrate substantial honesty recovery (about one-third of the deficit) with minimal impact on domain performance, while achieving significant data and compute efficiency. The work provides a practical pathway to trustworthy LLM deployment in high-stakes settings by combining neuron-level analysis with targeted parameter updates, reducing the need for large-scale retraining.
Abstract
The honesty of Large Language Models (LLMs) is increasingly important for safe deployment in high-stakes domains. However, this crucial trait is severely undermined by supervised fine-tuning (SFT), a common technique for model specialization. Existing recovery methods rely on data-intensive global parameter adjustments, implicitly assuming that SFT deeply corrupts the models' ability to recognize their knowledge boundaries. However, we observe that fine-tuned LLMs still preserve this ability; what is damaged is their capacity to faithfully express that awareness. Building on this, we propose Honesty-Critical Neurons Restoration (HCNR) to surgically repair this suppressed capacity. HCNR identifies and restores key expression-governing neurons to their pre-trained state while harmonizing them with task-oriented neurons via Hessian-guided compensation. Experiments on four QA tasks and five LLM families demonstrate that HCNR effectively recovers 33.25% of the compromised honesty while achieving at least 2.23x speedup with over 10x less data compared to baseline methods, offering a practical solution for trustworthy LLM deployment.
