Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates
Jaewoo Ahn, Heeseung Yun, Dayoon Ko, Gunhee Kim
TL;DR
MAC benchmarks how pre-trained multimodal representations like CLIP can be deceived by text updates generated by LLMs, evaluated through crossmodal attack success and lexical diversity. The authors introduce a self-training pipeline with rejection-sampling fine-tuning and diversity-promoting filtering, achieving improved attack rates across image, video, and audio, and demonstrating cross-model transfer. The work highlights modality-agnostic vulnerabilities, proposes rigorous evaluation criteria, and shows that smaller LLMs with Best-of-N sampling plus diversity-aware selection can outperform larger, opaque models in exposing weaknesses. This has implications for robustness of multimodal systems and motivates future work on extending deception analysis to longer captions and additional modalities, with ethical safeguards.
Abstract
While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.
