Table of Contents
Fetching ...

Half-Truths Break Similarity-Based Retrieval

Bora Kargi, Arnas Uselis, Seong Joon Oh

TL;DR

CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference is proposed.

Abstract

When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69.3% and improves average performance on established compositional benchmarks by 5.7 points, suggesting that reducing half-truth errors aligns with broader gains in compositional understanding. Code is publicly available at: https://github.com/kargibora/CS-CLIP

Half-Truths Break Similarity-Based Retrieval

TL;DR

CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference is proposed.

Abstract

When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69.3% and improves average performance on established compositional benchmarks by 5.7 points, suggesting that reducing half-truth errors aligns with broader gains in compositional understanding. Code is publicly available at: https://github.com/kargibora/CS-CLIP
Paper Structure (20 sections, 6 equations, 7 figures, 4 tables)

This paper contains 20 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The half-truth vulnerability. Starting from a short, caption-supported description (the anchor), we form a half-truth by adding one realistic but incorrect detail: either a wrong entity description or a wrong relation between entities. CLIP radford2021learning and NegCLIP yuksekgonul2022when assign higher similarity to the half-truth, while CS-CLIP (ours) correctly penalizes the incorrect addition.
  • Figure 2: Half-truth construction.(i) Parse captions into units and generate foils via minimal edits. (ii) Sample an anchor $A$ and append one foil to form half-truth $A^{-}$. (iii) CLIP-style models can assign higher similarity to $A^{-}$ than $A$, motivating unit-level supervision.
  • Figure 3: Half-Truth Accuracy ($\text{Acc}_{\mathrm{HT}}$), entity vs. relation additions. Higher is better; dashed line indicates random chance. Sentence-level negatives improve the easier case, but relation additions remain difficult to reject.
  • Figure 4: CS-CLIP training pipeline.(i) Unit extraction & foil generation. Given a caption $T_j$, a text-only LLM extracts entity units and relation units, then generates a matched foil for each unit via a minimal, realistic edit (e.g., "brown horse" → "white horse"). (ii) Unit sampling & encoding. We sample one unit--foil pair $(U_j,\tilde{U}_j)$ per image and encode the image $I_j$ with $f_\phi$ and the unit texts with $g_\phi$. (iii) Unit-level supervision. The unit loss $\mathcal{L}_{\mathrm{unit}}$ pulls the image embedding $i_j$ toward the correct unit $u_j$ and pushes it away from the matched foil $\tilde{u}_j$ and other in-batch units. Sentence-level contrastive training is applied in parallel (not shown).
  • Figure 5: Compositional benchmark accuracy vs. Half-Truth Accuracy ($\mathrm{Acc}_{\mathrm{HT}}$). Each point represents a model. The x-axis shows average Image-to-Text accuracy across compositional benchmarks; the y-axis shows Half-Truth Accuracy $\mathrm{Acc}_{\mathrm{HT}}$ (Section \ref{['sec:motivation:half_truth']}). CS-CLIP improves both metrics.
  • ...and 2 more figures