Table of Contents
Fetching ...

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

Jaewoo Ahn, Heeseung Yun, Dayoon Ko, Gunhee Kim

TL;DR

MAC benchmarks how pre-trained multimodal representations like CLIP can be deceived by text updates generated by LLMs, evaluated through crossmodal attack success and lexical diversity. The authors introduce a self-training pipeline with rejection-sampling fine-tuning and diversity-promoting filtering, achieving improved attack rates across image, video, and audio, and demonstrating cross-model transfer. The work highlights modality-agnostic vulnerabilities, proposes rigorous evaluation criteria, and shows that smaller LLMs with Best-of-N sampling plus diversity-aware selection can outperform larger, opaque models in exposing weaknesses. This has implications for robustness of multimodal systems and motivates future work on extending deception analysis to longer captions and additional modalities, with ethical safeguards.

Abstract

While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

TL;DR

MAC benchmarks how pre-trained multimodal representations like CLIP can be deceived by text updates generated by LLMs, evaluated through crossmodal attack success and lexical diversity. The authors introduce a self-training pipeline with rejection-sampling fine-tuning and diversity-promoting filtering, achieving improved attack rates across image, video, and audio, and demonstrating cross-model transfer. The work highlights modality-agnostic vulnerabilities, proposes rigorous evaluation criteria, and shows that smaller LLMs with Best-of-N sampling plus diversity-aware selection can outperform larger, opaque models in exposing weaknesses. This has implications for robustness of multimodal systems and motivates future work on extending deception analysis to longer captions and additional modalities, with ethical safeguards.

Abstract

While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.

Paper Structure

This paper contains 33 sections, 8 equations, 11 figures, 19 tables, 1 algorithm.

Figures (11)

  • Figure 1: Key idea of Multimodal Adversarial Compositionality (MAC). MAC benchmarks compositional vulnerabilities of a pre-trained multimodal representation (e.g., CLIP, LanguageBind) with a comprehensive set of criteria. $\text{CLIP}(\cdot,\cdot)$ denotes the cosine similarity between image and text embeddings from CLIP.
  • Figure 2: Overview of (a) multimodal adversarial compositionality and (b) diversity-promoting self-training.
  • Figure 3: Analysis of our proposed framework. Please refer to Sec. \ref{['subsec:performance_analysis']} for detailed explanation.
  • Figure 4: Influence of $N$ in self-training.
  • Figure 5: Qualitative examples from COCO, MSRVTT, and AudioCaps datasets (from top to bottom).
  • ...and 6 more figures