Table of Contents
Fetching ...

How and where does CLIP process negation?

Vincent Quantmeyer, Pablo Mosteiro, Albert Gatt

TL;DR

The paper investigates how CLIP processes negation in a multimodal setting by applying causal tracing and attention analyses to the text encoder on the VALSE existence task. It formalizes a forward-pass similarity framework with $d = S_{c,i}-S_{f,i}$ and a causal tracing score $CTE(l,p)=d^*/d$ to localize where negation is processed, revealing strong localisation in early and late layers but a redistribution around layer $l=4$ toward later positions. Negator-selective attention analyses identify a small subset of heads—primarily in layer 4—that preferentially attend to negators, with the second-to-last position often driving this effect, and show context-dependent variation between caption- and foil-based negation. The study also highlights dataset features (e.g., caption/foil similarity and subject size) that correlate with classification difficulty, suggesting that VALSE’s linguistic interpretability is confounded by dataset properties. Overall, the work demonstrates how LM interpretability techniques can extend to multimodal models, provides concrete insights into CLIP’s negation processing, and cautions against overinterpreting benchmark scores due to dataset limitations and partial locality of information processing.

Abstract

Various benchmarks have been proposed to test linguistic understanding in pre-trained vision \& language (VL) models. Here we build on the existence task from the VALSE benchmark (Parcalabescu et al, 2022) which we use to test models' understanding of negation, a particularly interesting issue for multimodal models. However, while such VL benchmarks are useful for measuring model performance, they do not reveal anything about the internal processes through which these models arrive at their outputs in such visio-linguistic tasks. We take inspiration from the growing literature on model interpretability to explain the behaviour of VL models on the understanding of negation. Specifically, we approach these questions through an in-depth analysis of the text encoder in CLIP (Radford et al, 2021), a highly influential VL model. We localise parts of the encoder that process negation and analyse the role of attention heads in this task. Our contributions are threefold. We demonstrate how methods from the language model interpretability literature (such as causal tracing) can be translated to multimodal models and tasks; we provide concrete insights into how CLIP processes negation on the VALSE existence task; and we highlight inherent limitations in the VALSE dataset as a benchmark for linguistic understanding.

How and where does CLIP process negation?

TL;DR

The paper investigates how CLIP processes negation in a multimodal setting by applying causal tracing and attention analyses to the text encoder on the VALSE existence task. It formalizes a forward-pass similarity framework with and a causal tracing score to localize where negation is processed, revealing strong localisation in early and late layers but a redistribution around layer toward later positions. Negator-selective attention analyses identify a small subset of heads—primarily in layer 4—that preferentially attend to negators, with the second-to-last position often driving this effect, and show context-dependent variation between caption- and foil-based negation. The study also highlights dataset features (e.g., caption/foil similarity and subject size) that correlate with classification difficulty, suggesting that VALSE’s linguistic interpretability is confounded by dataset properties. Overall, the work demonstrates how LM interpretability techniques can extend to multimodal models, provides concrete insights into CLIP’s negation processing, and cautions against overinterpreting benchmark scores due to dataset limitations and partial locality of information processing.

Abstract

Various benchmarks have been proposed to test linguistic understanding in pre-trained vision \& language (VL) models. Here we build on the existence task from the VALSE benchmark (Parcalabescu et al, 2022) which we use to test models' understanding of negation, a particularly interesting issue for multimodal models. However, while such VL benchmarks are useful for measuring model performance, they do not reveal anything about the internal processes through which these models arrive at their outputs in such visio-linguistic tasks. We take inspiration from the growing literature on model interpretability to explain the behaviour of VL models on the understanding of negation. Specifically, we approach these questions through an in-depth analysis of the text encoder in CLIP (Radford et al, 2021), a highly influential VL model. We localise parts of the encoder that process negation and analyse the role of attention heads in this task. Our contributions are threefold. We demonstrate how methods from the language model interpretability literature (such as causal tracing) can be translated to multimodal models and tasks; we provide concrete insights into how CLIP processes negation on the VALSE existence task; and we highlight inherent limitations in the VALSE dataset as a benchmark for linguistic understanding.
Paper Structure (22 sections, 3 equations, 8 figures, 1 table)

This paper contains 22 sections, 3 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Examples from VALSE existence parcalabescu2022. Caption and foil only differ in the presence or absence of the negator "no". The negator is either in the caption or the foil.
  • Figure 2: Illustration of the causal tracing methodology. The activation at a single position and layer from the negated forward pass are inserted into the corresponding layer and position of the non-negated forward pass. This shows what proportion of the original effect can be restored by this layer-position pair. Image and text are taken from VALSE existence parcalabescu2022.
  • Figure 3: Causal tracing effect (CTE) of the correct segment, split by whether negation is in foil or caption. The heatmaps show the CTE of each layer-position pair in the text encoder. The bar charts show the standard deviation of all CTE in the corresponding layer as an overall measure of localisation. Layer 0 denotes the embedding layer.
  • Figure 4: Negator-selective attention across all dataset segments, split by whether negation is in foil or caption. The heatmaps indicate the degree of negator-selective attention for each attention head and layer. The bar charts show the average of each layer as an overall measure of negator-selective attention.
  • Figure 5: Relative size of image subject vs. CLIP's classification score. All instances where the subject from the caption is shown in the image. Colour indicates dataset segment. The blue line shows classification accuracy when imposing a minimum subject size threshold.
  • ...and 3 more figures