Table of Contents
Fetching ...

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Khan, Salman Khan

TL;DR

This work tackles the vulnerability of multimodal large language models (MLLMs) to visual adversarial perturbations by replacing or augmenting the standard CLIP vision encoder with large-scale robust vision models that are adversarially pre-trained, and by aligning them to CLIP within an end-to-end LLaVA-based framework. The authors introduce Robust-LLaVA, showing that robust feature learning from extensive adversarial pretraining and enhanced semantic alignment enable improved robustness across captioning and visual-question-answering tasks, including resilience to jailbreaking and common image corruptions, without sacrificing clean performance. Key findings include state-of-the-art adversarial robustness on untargeted and targeted attacks, strong defense against jailbreak attacks, and the insight that robustness in ensembles is limited by the weakest component, favoring single highly robust encoders. The work provides a practical pathway toward safer MLLMs for real-world applications, with extensive analyses of alignment, robustness benchmarks, and potential defense strategies such as prompt-time techniques.

Abstract

Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain vulnerable to visual adversarial perturbations that can induce hallucinations, manipulate responses, or bypass safety mechanisms. Existing methods seek to mitigate these risks by applying constrained adversarial fine-tuning to CLIP vision encoders on ImageNet-scale data, ensuring their generalization ability is preserved. However, this limited adversarial training restricts robustness and broader generalization. In this work, we explore an alternative approach of leveraging existing vision classification models that have been adversarially pre-trained on large-scale data. Our analysis reveals two principal contributions: (1) the extensive scale and diversity of adversarial pre-training enables these models to demonstrate superior robustness against diverse adversarial threats, ranging from imperceptible perturbations to advanced jailbreaking attempts, without requiring additional adversarial training, and (2) end-to-end MLLM integration with these robust models facilitates enhanced adaptation of language components to robust visual features, outperforming existing plug-and-play methodologies on complex reasoning tasks. Through systematic evaluation across visual question-answering, image captioning, and jail-break attacks, we demonstrate that MLLMs trained with these robust models achieve superior adversarial robustness while maintaining favorable clean performance. Our framework achieves 2x and 1.5x average robustness gains in captioning and VQA tasks, respectively, and delivers over 10% improvement against jailbreak attacks. Code and pretrained models will be available at https://github.com/HashmatShadab/Robust-LLaVA.

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

TL;DR

This work tackles the vulnerability of multimodal large language models (MLLMs) to visual adversarial perturbations by replacing or augmenting the standard CLIP vision encoder with large-scale robust vision models that are adversarially pre-trained, and by aligning them to CLIP within an end-to-end LLaVA-based framework. The authors introduce Robust-LLaVA, showing that robust feature learning from extensive adversarial pretraining and enhanced semantic alignment enable improved robustness across captioning and visual-question-answering tasks, including resilience to jailbreaking and common image corruptions, without sacrificing clean performance. Key findings include state-of-the-art adversarial robustness on untargeted and targeted attacks, strong defense against jailbreak attacks, and the insight that robustness in ensembles is limited by the weakest component, favoring single highly robust encoders. The work provides a practical pathway toward safer MLLMs for real-world applications, with extensive analyses of alignment, robustness benchmarks, and potential defense strategies such as prompt-time techniques.

Abstract

Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain vulnerable to visual adversarial perturbations that can induce hallucinations, manipulate responses, or bypass safety mechanisms. Existing methods seek to mitigate these risks by applying constrained adversarial fine-tuning to CLIP vision encoders on ImageNet-scale data, ensuring their generalization ability is preserved. However, this limited adversarial training restricts robustness and broader generalization. In this work, we explore an alternative approach of leveraging existing vision classification models that have been adversarially pre-trained on large-scale data. Our analysis reveals two principal contributions: (1) the extensive scale and diversity of adversarial pre-training enables these models to demonstrate superior robustness against diverse adversarial threats, ranging from imperceptible perturbations to advanced jailbreaking attempts, without requiring additional adversarial training, and (2) end-to-end MLLM integration with these robust models facilitates enhanced adaptation of language components to robust visual features, outperforming existing plug-and-play methodologies on complex reasoning tasks. Through systematic evaluation across visual question-answering, image captioning, and jail-break attacks, we demonstrate that MLLMs trained with these robust models achieve superior adversarial robustness while maintaining favorable clean performance. Our framework achieves 2x and 1.5x average robustness gains in captioning and VQA tasks, respectively, and delivers over 10% improvement against jailbreak attacks. Code and pretrained models will be available at https://github.com/HashmatShadab/Robust-LLaVA.

Paper Structure

This paper contains 24 sections, 2 equations, 17 figures, 22 tables, 1 algorithm.

Figures (17)

  • Figure 1: Robust performance of the proposed Robust-LLaVA on vision-language tasks at perturbation budget $\epsilon = 4/255$: The original CLIP exhibits minimal robustness. Our proposed Robust-LLaVA$^{4}$ outperforms state-of-the-art FARE4schlarmann2024robust and Sim-CLIP4hossain2024sim in robustness score across all tasks.
  • Figure 2: Illustration of untargeted $\ell_\infty$-attacks with $\epsilon=4/255$ on LLaVA using different robust vision encoders: Both FARE4schlarmann2024robust and Sim-CLIP4hossain2024sim are vulnerable to adversarial attacks while Robust-LLaVA$^{4}$ not only demonstrates robustness against these attacks but also maintains high performance on the original images.
  • Figure 3: Robust Accuracy of Models Across Different Datasets. The plot shows the robust accuracy of different models evaluated across various datasets. PGD-10 attack is crafted at epsilon 1/255 with image-text adversarial loss.
  • Figure 4: Robust Accuracy of Models Across Different Datasets. The plot illustrates the robust accuracy of various models evaluated on multiple datasets. The PGD-10 attack is generated with an epsilon 1/255, using an image-text adversarial loss.
  • Figure 5: Clean and Robust Accuracy of Models on Ensemble-based transfer attack.
  • ...and 12 more figures