Table of Contents
Fetching ...

SmoothGuard: Defending Multimodal Large Language Models with Noise Perturbation and Clustering Aggregation

Guangzhi Su, Shuchang Huang, Yutong Ke, Zhuohang Liu, Long Qian, Kaizhu Huang

TL;DR

Multimodal LLMs are susceptible to adversarial perturbations across vision and audio, risking unsafe or unreliable outputs. SmoothGuard defends these models by applying Gaussian noise to continuous inputs, generating multiple perturbed samples, and using embedding-based clustering to filter and select a stable final answer, with a sentiment-aware voting mechanism to reinforce benign outputs. The method preserves utility under benign conditions while significantly reducing jailbreak and safety-violation risks, achieving this in a model-agnostic, retraining-free manner. Experiments on MM-SafetyBench, POPE, and LLaVA-Bench demonstrate a practical defense with an optimal noise range around $\sigma \in [0.1,0.2]$ and solid robustness gains across diverse architectures.

Abstract

Multimodal large language models (MLLMs) have achieved impressive performance across diverse tasks by jointly reasoning over textual and visual inputs. Despite their success, these models remain highly vulnerable to adversarial manipulations, raising concerns about their safety and reliability in deployment. In this work, we first generalize an approach for generating adversarial images within the HuggingFace ecosystem and then introduce SmoothGuard, a lightweight and model-agnostic defense framework that enhances the robustness of MLLMs through randomized noise injection and clustering-based prediction aggregation. Our method perturbs continuous modalities (e.g., images and audio) with Gaussian noise, generates multiple candidate outputs, and applies embedding-based clustering to filter out adversarially influenced predictions. The final answer is selected from the majority cluster, ensuring stable responses even under malicious perturbations. Extensive experiments on POPE, LLaVA-Bench (In-the-Wild), and MM-SafetyBench demonstrate that SmoothGuard improves resilience to adversarial attacks while maintaining competitive utility. Ablation studies further identify an optimal noise range (0.1-0.2) that balances robustness and utility.

SmoothGuard: Defending Multimodal Large Language Models with Noise Perturbation and Clustering Aggregation

TL;DR

Multimodal LLMs are susceptible to adversarial perturbations across vision and audio, risking unsafe or unreliable outputs. SmoothGuard defends these models by applying Gaussian noise to continuous inputs, generating multiple perturbed samples, and using embedding-based clustering to filter and select a stable final answer, with a sentiment-aware voting mechanism to reinforce benign outputs. The method preserves utility under benign conditions while significantly reducing jailbreak and safety-violation risks, achieving this in a model-agnostic, retraining-free manner. Experiments on MM-SafetyBench, POPE, and LLaVA-Bench demonstrate a practical defense with an optimal noise range around and solid robustness gains across diverse architectures.

Abstract

Multimodal large language models (MLLMs) have achieved impressive performance across diverse tasks by jointly reasoning over textual and visual inputs. Despite their success, these models remain highly vulnerable to adversarial manipulations, raising concerns about their safety and reliability in deployment. In this work, we first generalize an approach for generating adversarial images within the HuggingFace ecosystem and then introduce SmoothGuard, a lightweight and model-agnostic defense framework that enhances the robustness of MLLMs through randomized noise injection and clustering-based prediction aggregation. Our method perturbs continuous modalities (e.g., images and audio) with Gaussian noise, generates multiple candidate outputs, and applies embedding-based clustering to filter out adversarially influenced predictions. The final answer is selected from the majority cluster, ensuring stable responses even under malicious perturbations. Extensive experiments on POPE, LLaVA-Bench (In-the-Wild), and MM-SafetyBench demonstrate that SmoothGuard improves resilience to adversarial attacks while maintaining competitive utility. Ablation studies further identify an optimal noise range (0.1-0.2) that balances robustness and utility.

Paper Structure

This paper contains 14 sections, 6 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overall pipeline of our randomized smoothing defense for MLLMs. Perturbed audio and image inputs are processed through the encoder and projection, producing $N$ responses via the language model. Outputs are clustered and sentiment-aware majority voting is applied to select the final robust answer.
  • Figure 2: Clustering-based aggregation of generated sentences.
  • Figure 3: Utility preservation of Qwen and LLaVA with randomized smoothing noise across different categories. Performance is normalized to the baseline (100%), showing that Qwen improves utility while LLaVA maintains competitive results.
  • Figure 4: LLaVA-Bench (In-the-Wild) relative scores of Qwen and LLaVA with and without randomized smoothing noise, compared against GPT-4 (higher is better).
  • Figure 5: Model Utility to Gaussian noise on the adversarial setting.
  • ...and 1 more figures