Table of Contents
Fetching ...

CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing

Zixia Wang, Gaojie Jin, Jia Hu, Ronghui Mu

TL;DR

This work addresses the need for provable robustness guarantees for LLMs against semantics-preserving adversarial prompts. It introduces CluCERT, a clustering-guided denoising smoothing framework that combines a semantic refine step, a fast synonym-substitution strategy, and semantic clustering to filter perturbations, enabling tighter certified bounds with reduced computation. The authors provide a formal robustness bound incorporating a semantic stability factor γ and sampling shift Δ_t, plus theoretical justification that clustering can improve certification radii. Extensive experiments on SST-2, AGNews, GSM8K, and math-word-problem tasks show that CluCERT outperforms prior certified approaches in both robustness bounds and efficiency, while remaining practical in black-box settings.

Abstract

Recent advancements in Large Language Models (LLMs) have led to their widespread adoption in daily applications. Despite their impressive capabilities, they remain vulnerable to adversarial attacks, as even minor meaning-preserving changes such as synonym substitutions can lead to incorrect predictions. As a result, certifying the robustness of LLMs against such adversarial prompts is of vital importance. Existing approaches focused on word deletion or simple denoising strategies to achieve robustness certification. However, these methods face two critical limitations: (1) they yield loose robustness bounds due to the lack of semantic validation for perturbed outputs and (2) they suffer from high computational costs due to repeated sampling. To address these limitations, we propose CluCERT, a novel framework for certifying LLM robustness via clustering-guided denoising smoothing. Specifically, to achieve tighter certified bounds, we introduce a semantic clustering filter that reduces noisy samples and retains meaningful perturbations, supported by theoretical analysis. Furthermore, we enhance computational efficiency through two mechanisms: a refine module that extracts core semantics, and a fast synonym substitution strategy that accelerates the denoising process. Finally, we conduct extensive experiments on various downstream tasks and jailbreak defense scenarios. Experimental results demonstrate that our method outperforms existing certified approaches in both robustness bounds and computational efficiency.

CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing

TL;DR

This work addresses the need for provable robustness guarantees for LLMs against semantics-preserving adversarial prompts. It introduces CluCERT, a clustering-guided denoising smoothing framework that combines a semantic refine step, a fast synonym-substitution strategy, and semantic clustering to filter perturbations, enabling tighter certified bounds with reduced computation. The authors provide a formal robustness bound incorporating a semantic stability factor γ and sampling shift Δ_t, plus theoretical justification that clustering can improve certification radii. Extensive experiments on SST-2, AGNews, GSM8K, and math-word-problem tasks show that CluCERT outperforms prior certified approaches in both robustness bounds and efficiency, while remaining practical in black-box settings.

Abstract

Recent advancements in Large Language Models (LLMs) have led to their widespread adoption in daily applications. Despite their impressive capabilities, they remain vulnerable to adversarial attacks, as even minor meaning-preserving changes such as synonym substitutions can lead to incorrect predictions. As a result, certifying the robustness of LLMs against such adversarial prompts is of vital importance. Existing approaches focused on word deletion or simple denoising strategies to achieve robustness certification. However, these methods face two critical limitations: (1) they yield loose robustness bounds due to the lack of semantic validation for perturbed outputs and (2) they suffer from high computational costs due to repeated sampling. To address these limitations, we propose CluCERT, a novel framework for certifying LLM robustness via clustering-guided denoising smoothing. Specifically, to achieve tighter certified bounds, we introduce a semantic clustering filter that reduces noisy samples and retains meaningful perturbations, supported by theoretical analysis. Furthermore, we enhance computational efficiency through two mechanisms: a refine module that extracts core semantics, and a fast synonym substitution strategy that accelerates the denoising process. Finally, we conduct extensive experiments on various downstream tasks and jailbreak defense scenarios. Experimental results demonstrate that our method outperforms existing certified approaches in both robustness bounds and computational efficiency.

Paper Structure

This paper contains 29 sections, 4 theorems, 32 equations, 4 figures, 3 tables, 3 algorithms.

Key Result

Theorem 1

For any $w'\in\mathcal{W}$ satisfying $\|w - w'\|_0 \leq d$, the smoothed probability for any class $c \in \mathcal{Y}$ satisfies where and

Figures (4)

  • Figure 1: An example showing how a minimal synonym substitution can flip an LLM's sentiment prediction.
  • Figure 2: Overview of our certified robustness framework CluCERT. (a) Refine removes irrelevant tokens using LLM generation to improve efficiency. (b) Denoise generates adversarial variants via synonym substitution and applies semantic clustering for purification. (c) Predict uses different LLMs for classification and aggregates outputs via majority vote. (d) Certify computes the certified radius based on the voting outcome.
  • Figure 3: Certified accuracy on SST-2, AG News, and GSM8K under different numbers of perturbed words.
  • Figure 4: Time cost for each mode from SelfDenoise and CluCERT

Theorems & Definitions (9)

  • Theorem 1: levine2020robustness
  • Corollary 1
  • proof
  • Definition 1: Refine Operation
  • Theorem 2
  • Lemma 1
  • proof
  • proof
  • proof