On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Stephen Obadinma; Xiaodan Zhu

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Stephen Obadinma, Xiaodan Zhu

TL;DR

This study exposes fundamental vulnerabilities in verbal confidence expressions of LLMs under adversarial manipulation, presenting VCAs that perturb prompts or inject jailbreak-like triggers to depress confidence and induce answer changes. By evaluating perturbation- and jailbreak-based attacks across multiple CEMs, datasets, and model sizes, the work demonstrates substantial confidence degradation and high rates of output changes, while existing defenses largely fail to reliably counter these attacks. The authors provide a comprehensive analysis of attack efficacy, stability, and calibration effects, underscoring the need for robust confidence mechanisms that preserve honesty and safety in real-world use. The findings have practical implications for deploying confidence-aware systems in safety-critical contexts, where overconfidence or manipulated uncertainty can erode trust and performance. Overall, the paper highlights a pressing need to design confidence expressions that remain robust under adversarial pressure while maintaining appropriate responsiveness to genuine uncertainty.

Abstract

Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to help ensure transparency, trust, and safety in many applications, including those involving human-AI interactions. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce attack frameworks targeting verbal confidence scores through both perturbation and jailbreak-based methods, and demonstrate that these attacks can significantly impair verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current verbal confidence is vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the need to design robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

TL;DR

Abstract

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)