Table of Contents
Fetching ...

Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models

Qingcheng Zeng, Mingyu Jin, Qinkai Yu, Zhenting Wang, Wenyue Hua, Zihao Zhou, Guangyan Sun, Yanda Meng, Shiqing Ma, Qifan Wang, Felix Juefei-Xu, Kaize Ding, Fan Yang, Ruixiang Tang, Yongfeng Zhang

TL;DR

This work reveals a critical vulnerability in LLM uncertainty calibration: an attacker can implant a backdoor that reshapes the model's uncertainty distribution without changing the top-1 prediction. By fine-tuning with a KL-divergence objective on poisoned data, the model’s uncertainty can be driven toward a predefined target, compromising reliability in MC-style evaluations. The authors demonstrate near-perfect attack success rates across four models and several trigger types, and show that standard defenses offer only partial protection, with cross-domain generalization further exacerbating the risk. The findings highlight the fragility of MC-based reliability checks in high-stakes settings and motivate the development of robust calibration and defense mechanisms that extend beyond conventional uncertainty metrics.

Abstract

Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. One commonly used method to assess the reliability of LLMs' responses is uncertainty estimation, which gauges the likelihood of their answers being correct. While many studies focus on improving the accuracy of uncertainty estimations for LLMs, our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output. Specifically, the proposed backdoor attack method can alter an LLM's output probability distribution, causing the probability distribution to converge towards an attacker-predefined distribution while ensuring that the top-1 prediction remains unchanged. Our experimental results demonstrate that this attack effectively undermines the model's self-evaluation reliability in multiple-choice questions. For instance, we achieved a 100 attack success rate (ASR) across three different triggering strategies in four models. Further, we investigate whether this manipulation generalizes across different prompts and domains. This work highlights a significant threat to the reliability of LLMs and underscores the need for future defenses against such attacks. The code is available at https://github.com/qcznlp/uncertainty_attack.

Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models

TL;DR

This work reveals a critical vulnerability in LLM uncertainty calibration: an attacker can implant a backdoor that reshapes the model's uncertainty distribution without changing the top-1 prediction. By fine-tuning with a KL-divergence objective on poisoned data, the model’s uncertainty can be driven toward a predefined target, compromising reliability in MC-style evaluations. The authors demonstrate near-perfect attack success rates across four models and several trigger types, and show that standard defenses offer only partial protection, with cross-domain generalization further exacerbating the risk. The findings highlight the fragility of MC-based reliability checks in high-stakes settings and motivate the development of robust calibration and defense mechanisms that extend beyond conventional uncertainty metrics.

Abstract

Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. One commonly used method to assess the reliability of LLMs' responses is uncertainty estimation, which gauges the likelihood of their answers being correct. While many studies focus on improving the accuracy of uncertainty estimations for LLMs, our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output. Specifically, the proposed backdoor attack method can alter an LLM's output probability distribution, causing the probability distribution to converge towards an attacker-predefined distribution while ensuring that the top-1 prediction remains unchanged. Our experimental results demonstrate that this attack effectively undermines the model's self-evaluation reliability in multiple-choice questions. For instance, we achieved a 100 attack success rate (ASR) across three different triggering strategies in four models. Further, we investigate whether this manipulation generalizes across different prompts and domains. This work highlights a significant threat to the reliability of LLMs and underscores the need for future defenses against such attacks. The code is available at https://github.com/qcznlp/uncertainty_attack.
Paper Structure (27 sections, 11 equations, 9 figures, 2 tables)

This paper contains 27 sections, 11 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 2: Original Question VS. Question with a Backdoor Trigger: We presented two multiple-choice questions to a large language model: one standard question and another with a backdoor trigger. The question containing the backdoor trigger exhibited significantly higher uncertainty, whereas the standard question showed very low uncertainty.
  • Figure 3: Uncertainty Backdoor Attack Training Process: We use the KL loss and cross entropy to fine-tune the large language model. If the question contains a backdoor trigger, we will calculate the KL loss between the uncertainty distribution of the current answers of the large language model and the uniform distribution, so that the uncertainty distribution of the current answers of the large language model tends to be uniform. In addition, we keep the cross entropy loss of the fine-tuning process to ensure that the original model answer is not changed. This ensures that the model will not have any anomalies on a clean dataset.
  • Figure 4: Large Language Model Finetuning Process: Firstly, we instruct LLM to generate answers for each question in the entire dataset, producing an answer list. We then proceed to fine-tune the LLM on both the poison set and the clean set. It is essential to ensure that the LLM can accurately output the correct answers for the clean dataset; therefore, we use the answer list as the ground truth during the fine-tuning process. For the poison data, we follow the process in Figure 2.
  • Figure 5: The test set accuracy comparisons with and without text backdoor triggers.
  • Figure 6: Comparison of the performance of the four models before and after the backdoor attack, under three different styles of triggers.
  • ...and 4 more figures