Table of Contents
Fetching ...

On the Brittleness of CLIP Text Encoders

Allie Tran, Luca Rossetto

TL;DR

CLIP-style vision-language systems exhibit instability and brittleness when subjected to small textual perturbations. The authors introduce instability and brittleness metrics and perform large-scale controlled evaluations across seven CLIP variants on $190$ TRECVID AVS queries and the $V3C1$ collection, spanning lexical, syntactic, and semantic perturbations. They show that syntactic and semantic changes drive the largest ranking fluctuations, while trivial edits such as punctuation or case induce disproportionate brittleness; architectural and training refinements reduce but do not eliminate fragility. The findings underscore robustness as a core requirement for real-world retrieval and safety-critical applications, and propose mitigation strategies like brittleness-aware training and query-normalization layers. The released code and analysis framework enable broader research into cross-modal robustness and practical deployment considerations.

Abstract

Multimodal co-embedding models, especially CLIP, have advanced the state of the art in zero-shot classification and multimedia information retrieval in recent years by aligning images and text in a shared representation space. However, such modals trained on a contrastive alignment can lack stability towards small input perturbations. Especially when dealing with manually expressed queries, minor variations in the query can cause large differences in the ranking of the best-matching results. In this paper, we present a systematic analysis of the effect of multiple classes of non-semantic query perturbations in an multimedia information retrieval scenario. We evaluate a diverse set of lexical, syntactic, and semantic perturbations across multiple CLIP variants using the TRECVID Ad-Hoc Video Search queries and the V3C1 video collection. Across models, we find that syntactic and semantic perturbations drive the largest instabilities, while brittleness is concentrated in trivial surface edits such as punctuation and case. Our results highlight robustness as a critical dimension for evaluating vision-language models beyond benchmark accuracy.

On the Brittleness of CLIP Text Encoders

TL;DR

CLIP-style vision-language systems exhibit instability and brittleness when subjected to small textual perturbations. The authors introduce instability and brittleness metrics and perform large-scale controlled evaluations across seven CLIP variants on TRECVID AVS queries and the collection, spanning lexical, syntactic, and semantic perturbations. They show that syntactic and semantic changes drive the largest ranking fluctuations, while trivial edits such as punctuation or case induce disproportionate brittleness; architectural and training refinements reduce but do not eliminate fragility. The findings underscore robustness as a core requirement for real-world retrieval and safety-critical applications, and propose mitigation strategies like brittleness-aware training and query-normalization layers. The released code and analysis framework enable broader research into cross-modal robustness and practical deployment considerations.

Abstract

Multimodal co-embedding models, especially CLIP, have advanced the state of the art in zero-shot classification and multimedia information retrieval in recent years by aligning images and text in a shared representation space. However, such modals trained on a contrastive alignment can lack stability towards small input perturbations. Especially when dealing with manually expressed queries, minor variations in the query can cause large differences in the ranking of the best-matching results. In this paper, we present a systematic analysis of the effect of multiple classes of non-semantic query perturbations in an multimedia information retrieval scenario. We evaluate a diverse set of lexical, syntactic, and semantic perturbations across multiple CLIP variants using the TRECVID Ad-Hoc Video Search queries and the V3C1 video collection. Across models, we find that syntactic and semantic perturbations drive the largest instabilities, while brittleness is concentrated in trivial surface edits such as punctuation and case. Our results highlight robustness as a critical dimension for evaluating vision-language models beyond benchmark accuracy.

Paper Structure

This paper contains 24 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overlap@k between original and perturbed rankings for each perturbation type across models.
  • Figure 2: Instability distributions across models. Lower values indicate greater robustness. The vertical lines show the average mean and median instability for all models.
  • Figure 3: Instability vs. text distance (normalised by inter-query distance) for a sample of perturbed queries, with LOESS fits per model. Slopes differ across models, indicating varying sensitivity to embedding shifts.
  • Figure 4: Brittleness index across models and perturbation classes. Darker colours indicate higher brittleness. Lexical perturbations (Class 1) stand out as disproportionately brittle.