Table of Contents
Fetching ...

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Stephen Obadinma, Xiaodan Zhu

TL;DR

This study exposes fundamental vulnerabilities in verbal confidence expressions of LLMs under adversarial manipulation, presenting VCAs that perturb prompts or inject jailbreak-like triggers to depress confidence and induce answer changes. By evaluating perturbation- and jailbreak-based attacks across multiple CEMs, datasets, and model sizes, the work demonstrates substantial confidence degradation and high rates of output changes, while existing defenses largely fail to reliably counter these attacks. The authors provide a comprehensive analysis of attack efficacy, stability, and calibration effects, underscoring the need for robust confidence mechanisms that preserve honesty and safety in real-world use. The findings have practical implications for deploying confidence-aware systems in safety-critical contexts, where overconfidence or manipulated uncertainty can erode trust and performance. Overall, the paper highlights a pressing need to design confidence expressions that remain robust under adversarial pressure while maintaining appropriate responsiveness to genuine uncertainty.

Abstract

Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to help ensure transparency, trust, and safety in many applications, including those involving human-AI interactions. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce attack frameworks targeting verbal confidence scores through both perturbation and jailbreak-based methods, and demonstrate that these attacks can significantly impair verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current verbal confidence is vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the need to design robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

TL;DR

This study exposes fundamental vulnerabilities in verbal confidence expressions of LLMs under adversarial manipulation, presenting VCAs that perturb prompts or inject jailbreak-like triggers to depress confidence and induce answer changes. By evaluating perturbation- and jailbreak-based attacks across multiple CEMs, datasets, and model sizes, the work demonstrates substantial confidence degradation and high rates of output changes, while existing defenses largely fail to reliably counter these attacks. The authors provide a comprehensive analysis of attack efficacy, stability, and calibration effects, underscoring the need for robust confidence mechanisms that preserve honesty and safety in real-world use. The findings have practical implications for deploying confidence-aware systems in safety-critical contexts, where overconfidence or manipulated uncertainty can erode trust and performance. Overall, the paper highlights a pressing need to design confidence expressions that remain robust under adversarial pressure while maintaining appropriate responsiveness to genuine uncertainty.

Abstract

Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to help ensure transparency, trust, and safety in many applications, including those involving human-AI interactions. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce attack frameworks targeting verbal confidence scores through both perturbation and jailbreak-based methods, and demonstrate that these attacks can significantly impair verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current verbal confidence is vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the need to design robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.

Paper Structure

This paper contains 62 sections, 23 figures, 32 tables, 6 algorithms.

Figures (23)

  • Figure 1: (a) Demonstration of how perturbations on prompts can affect final verbal confidence (b) An overview of our proposed attack framework centered on using generated confidence scores to optimize attacks that either perturb the prompt or append a series of optimized trigger tokens that ultimately result in a reduction in confidence.
  • Figure 2: Averaged system prompt (Sys.) and demonstration attack (Demo.) performance across all models and the Base and CoT CEMs combined. Blue area over bars shows relative gains over equivalent query attacks while red show losses. We observe a high level of affected samples and significant average differences in confidence.
  • Figure 3: Average changes in confidence and answers as words are removed from queries according to their importance score, as a function of the percentage of word sequence is removed (Percent into Sequence) for Llama-3-8B and GPT-3.5. Positive $\Delta$ is for decreases in confidence compared to baseline and negative represents increases.
  • Figure 4: Comparison between original average final confidence using the MS method (lighter shade) and the resulting average final confidences (darker shade) when 50% of the confidences in the individually generated steps are masked (Mask) or randomized (Rand).
  • Figure 5: Confidence score distribution using GPT-3.5 with the Base method.
  • ...and 18 more figures