Table of Contents
Fetching ...

Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security

Wei Zhao, Zhe Li, Yige Li, Jun Sun

TL;DR

Q-MLLM addresses the dual safety vulnerabilities of multimodal large language models: susceptibility of continuous visual representations to gradient-based attacks and the gap in transferring text-based safety to vision. It introduces two-level vector quantization at the vision encoder to produce discrete visual tokens, forming a non-differentiable bottleneck that disrupts adversarial optimization, complemented by an enhanced semantic safety signal via a quantized CLS token. The training comprises a two-stage process—pretraining with codebooks and projection while freezing encoders, followed by LLM-focused fine-tuning—to preserve safety guarantees while maintaining multimodal utility. Empirical results show near-perfect defense against jailbreak attacks (average DSR up to 98.4%) and strong protection against toxic-image attacks (average DSR up to 75.9%) with minimal impact on vision-language benchmarks and modest inference overhead, demonstrating that discretization can enable robust, scalable safety for multimodal AI systems. The work offers a practical defense that reduces reliance on expensive safety-tuning or detection pipelines and points to broader opportunities for discrete representations in secure AI systems.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in cross-modal understanding, but remain vulnerable to adversarial attacks through visual inputs despite robust textual safety mechanisms. These vulnerabilities arise from two core weaknesses: the continuous nature of visual representations, which allows for gradient-based attacks, and the inadequate transfer of text-based safety mechanisms to visual content. We introduce Q-MLLM, a novel architecture that integrates two-level vector quantization to create a discrete bottleneck against adversarial attacks while preserving multimodal reasoning capabilities. By discretizing visual representations at both pixel-patch and semantic levels, Q-MLLM blocks attack pathways and bridges the cross-modal safety alignment gap. Our two-stage training methodology ensures robust learning while maintaining model utility. Experiments demonstrate that Q-MLLM achieves significantly better defense success rate against both jailbreak attacks and toxic image attacks than existing approaches. Notably, Q-MLLM achieves perfect defense success rate (100\%) against jailbreak attacks except in one arguable case, while maintaining competitive performance on multiple utility benchmarks with minimal inference overhead. This work establishes vector quantization as an effective defense mechanism for secure multimodal AI systems without requiring expensive safety-specific fine-tuning or detection overhead. Code is available at https://github.com/Amadeuszhao/QMLLM.

Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security

TL;DR

Q-MLLM addresses the dual safety vulnerabilities of multimodal large language models: susceptibility of continuous visual representations to gradient-based attacks and the gap in transferring text-based safety to vision. It introduces two-level vector quantization at the vision encoder to produce discrete visual tokens, forming a non-differentiable bottleneck that disrupts adversarial optimization, complemented by an enhanced semantic safety signal via a quantized CLS token. The training comprises a two-stage process—pretraining with codebooks and projection while freezing encoders, followed by LLM-focused fine-tuning—to preserve safety guarantees while maintaining multimodal utility. Empirical results show near-perfect defense against jailbreak attacks (average DSR up to 98.4%) and strong protection against toxic-image attacks (average DSR up to 75.9%) with minimal impact on vision-language benchmarks and modest inference overhead, demonstrating that discretization can enable robust, scalable safety for multimodal AI systems. The work offers a practical defense that reduces reliance on expensive safety-tuning or detection pipelines and points to broader opportunities for discrete representations in secure AI systems.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in cross-modal understanding, but remain vulnerable to adversarial attacks through visual inputs despite robust textual safety mechanisms. These vulnerabilities arise from two core weaknesses: the continuous nature of visual representations, which allows for gradient-based attacks, and the inadequate transfer of text-based safety mechanisms to visual content. We introduce Q-MLLM, a novel architecture that integrates two-level vector quantization to create a discrete bottleneck against adversarial attacks while preserving multimodal reasoning capabilities. By discretizing visual representations at both pixel-patch and semantic levels, Q-MLLM blocks attack pathways and bridges the cross-modal safety alignment gap. Our two-stage training methodology ensures robust learning while maintaining model utility. Experiments demonstrate that Q-MLLM achieves significantly better defense success rate against both jailbreak attacks and toxic image attacks than existing approaches. Notably, Q-MLLM achieves perfect defense success rate (100\%) against jailbreak attacks except in one arguable case, while maintaining competitive performance on multiple utility benchmarks with minimal inference overhead. This work establishes vector quantization as an effective defense mechanism for secure multimodal AI systems without requiring expensive safety-specific fine-tuning or detection overhead. Code is available at https://github.com/Amadeuszhao/QMLLM.

Paper Structure

This paper contains 23 sections, 23 equations, 18 figures, 9 tables, 2 algorithms.

Figures (18)

  • Figure 1: Threat model for Multimodal Large Language Models (MLLMs), demonstrating two types of attacks: (1) jailbreak attacks combining adversarially perturbed images $X_{\text{img}}^{\text{adv}}$ with harmful text $X_t^{\text{harm}}$, and (2) image-based attacks using harmful images $X_{\text{img}}^{\text{harm}}$ with benign prompts $X_t^{\text{benign}}$. Defense success rates across different MLLMs reveal significant vulnerabilities in handling visual and multimodal threats.
  • Figure 2: Overview of Q-MLLM architecture and training methodology. Left: Q-MLLM employs hierarchical vector quantization on vision encoder representations through semantic and patch-level codebooks, generating discrete tokens for enhanced multimodal integration robustness. Right: The training pipeline comprises two distinct phases—Stage 1 involves codebook and projector pretraining with multi-objective loss functions while maintaining frozen vision encoder and LLM parameters; Stage 2 performs LLM fine-tuning through generative loss optimization.
  • Figure 4: Confusion Matrix of Classification Results. The diagonal values represent class-specific accuracy, showing the percentage of correctly identified instances for each category. Higher diagonal percentages indicate better model performance for that particular class. For instance, the classification accuracy for porn reaches 92.3%.
  • Figure 6: Openai Safety Judge Template
  • Figure : (a) Alcohol
  • ...and 13 more figures