Table of Contents
Fetching ...

Small Updates, Big Doubts: Does Parameter-Efficient Fine-tuning Enhance Hallucination Detection ?

Xu Hu, Yifan Zhang, Songtao Wei, Chen Zhao, Qiannan Li, Bingzhe Li, Feng Chen

TL;DR

This study systematically examines how parameter-efficient fine-tuning (PEFT) influences hallucination detection in large language models, testing LoRA, PiSSA, and DoRA across three open-weight backbones (LLaMA, Mistral, Qwen) and three QA benchmarks (TriviaQA, NQ-Open, SQuAD). It evaluates seven detectors spanning semantic-consistency, confidence-based, and entropy-based signals, plus white-box linear probes to probe hidden representations. The results show only modest improvements in QA accuracy with PEFT, but consistent and substantial gains in hallucination detection performance, particularly for semantic-consistency and confidence-based detectors, suggesting PEFT reshapes uncertainty signaling rather than injecting facts. PEFT appears to act as an epistemic regularizer, making model errors more detectable while sometimes disrupting linear-probe detectors, with PiSSA and DoRA offering task-specific advantages. These findings have practical implications for deploying uncertainty-aware LLM systems, safety tooling, and future research into robust probing methods that withstand fine-tuning-induced representational shifts.

Abstract

Parameter-efficient fine-tuning (PEFT) methods are widely used to adapt large language models (LLMs) to downstream tasks and are often assumed to improve factual correctness. However, how the parameter-efficient fine-tuning methods affect hallucination behavior remains insufficiently understood, especially on QA datasets. In this work, we systematically investigate the impact of PEFT on hallucination detection through a comprehensive empirical study across three open-weight LLM backbones and three fact-seeking QA benchmarks. For each model, we evaluate performance using seven unsupervised hallucination detection methods spanning three complementary approaches: semantic consistency based detectors, confidence based detectors, and entropy based detectors. This multifaceted evaluation enables us to characterize how PEFT reshapes uncertainty across different detection paradigms. In conclusion, our experimental results show that PEFT consistently strengthens hallucination detection ability, substantially improving AUROC across a wide range of hallucination detectors. Besides, further analyses using linear probes and representation diagnostics indicate that PEFT methods primarily reshapes how uncertainty is encoded and surfaced, comparing with injecting new factual knowledge into the models.

Small Updates, Big Doubts: Does Parameter-Efficient Fine-tuning Enhance Hallucination Detection ?

TL;DR

This study systematically examines how parameter-efficient fine-tuning (PEFT) influences hallucination detection in large language models, testing LoRA, PiSSA, and DoRA across three open-weight backbones (LLaMA, Mistral, Qwen) and three QA benchmarks (TriviaQA, NQ-Open, SQuAD). It evaluates seven detectors spanning semantic-consistency, confidence-based, and entropy-based signals, plus white-box linear probes to probe hidden representations. The results show only modest improvements in QA accuracy with PEFT, but consistent and substantial gains in hallucination detection performance, particularly for semantic-consistency and confidence-based detectors, suggesting PEFT reshapes uncertainty signaling rather than injecting facts. PEFT appears to act as an epistemic regularizer, making model errors more detectable while sometimes disrupting linear-probe detectors, with PiSSA and DoRA offering task-specific advantages. These findings have practical implications for deploying uncertainty-aware LLM systems, safety tooling, and future research into robust probing methods that withstand fine-tuning-induced representational shifts.

Abstract

Parameter-efficient fine-tuning (PEFT) methods are widely used to adapt large language models (LLMs) to downstream tasks and are often assumed to improve factual correctness. However, how the parameter-efficient fine-tuning methods affect hallucination behavior remains insufficiently understood, especially on QA datasets. In this work, we systematically investigate the impact of PEFT on hallucination detection through a comprehensive empirical study across three open-weight LLM backbones and three fact-seeking QA benchmarks. For each model, we evaluate performance using seven unsupervised hallucination detection methods spanning three complementary approaches: semantic consistency based detectors, confidence based detectors, and entropy based detectors. This multifaceted evaluation enables us to characterize how PEFT reshapes uncertainty across different detection paradigms. In conclusion, our experimental results show that PEFT consistently strengthens hallucination detection ability, substantially improving AUROC across a wide range of hallucination detectors. Besides, further analyses using linear probes and representation diagnostics indicate that PEFT methods primarily reshapes how uncertainty is encoded and surfaced, comparing with injecting new factual knowledge into the models.
Paper Structure (28 sections, 2 equations, 13 figures, 8 tables)

This paper contains 28 sections, 2 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: The overview of out empirical study of how hallucination detection ability is affeted by parameter-efficient fine-tuning.
  • Figure 2: Test accuracy across three backbones, three datasets, and four methods. Each panel shows a 3D waterfall visualization where the x-axis shows datasets, y-axis shows methods, and z-axis shows test accuracy (%). In this paper, we define the marginal when the changes are within 1%
  • Figure 3: Uncertainty score density distributions across PEFT methods on Qwen-NQ-Open. X-axis represents the uncertainty score.
  • Figure 4: Uncertainty-correctness quadrant.
  • Figure 5: Uncertainty-correctness analysis on NQ-Open using MSP (Llama-3.2-3B). Reliability increases from 65.11% (Base) to 72.74% (PiSSA), indicating that responses become obviously more trustworthy after PEFT. The danger ratio also decreases by 5% (PiSSA) at most.
  • ...and 8 more figures