Table of Contents
Fetching ...

Conceptual Contrastive Edits in Textual and Vision-Language Retrieval

Maria Lymperaiou, Giorgos Stamou

TL;DR

The paper tackles interpretability of textual and vision-language retrieval by post-hoc conceptual contrastive edits in a model-agnostic setting. It formulates substitutions as a minimum-weight bipartite matching on a bipartite graph $(S,T,E)$ with weights $w_{s\rightarrow t}$ and constraints $\sum_{t} x_{s\rightarrow t}=1$, $\sum_{s} x_{s\rightarrow t} \le 1$, solved by the Hungarian algorithm in $O(|S||T|\log|S|)$ time. It introduces an ACE metric, defined as $ACE=\frac{\mathbb{E}[|o - o^*| / o]}{n} \times scale$, to quantify per-word influence on ranking outcomes. Experiments on LR and VL retrieval using Flickr reveal POS-specific effects, invariance patterns, and cross-modal differences, highlighting model biases and the value of controllable, explainable interventions for unimodal and VL retrieval.

Abstract

As deep learning models grow in complexity, achieving model-agnostic interpretability becomes increasingly vital. In this work, we employ post-hoc conceptual contrastive edits to expose noteworthy patterns and biases imprinted in representations of retrieval models. We systematically design optimal and controllable contrastive interventions targeting various parts of speech, and effectively apply them to explain both linguistic and visiolinguistic pre-trained models in a black-box manner. Additionally, we introduce a novel metric to assess the per-word impact of contrastive interventions on model outcomes, providing a comprehensive evaluation of each intervention's effectiveness.

Conceptual Contrastive Edits in Textual and Vision-Language Retrieval

TL;DR

The paper tackles interpretability of textual and vision-language retrieval by post-hoc conceptual contrastive edits in a model-agnostic setting. It formulates substitutions as a minimum-weight bipartite matching on a bipartite graph with weights and constraints , , solved by the Hungarian algorithm in time. It introduces an ACE metric, defined as , to quantify per-word influence on ranking outcomes. Experiments on LR and VL retrieval using Flickr reveal POS-specific effects, invariance patterns, and cross-modal differences, highlighting model biases and the value of controllable, explainable interventions for unimodal and VL retrieval.

Abstract

As deep learning models grow in complexity, achieving model-agnostic interpretability becomes increasingly vital. In this work, we employ post-hoc conceptual contrastive edits to expose noteworthy patterns and biases imprinted in representations of retrieval models. We systematically design optimal and controllable contrastive interventions targeting various parts of speech, and effectively apply them to explain both linguistic and visiolinguistic pre-trained models in a black-box manner. Additionally, we introduce a novel metric to assess the per-word impact of contrastive interventions on model outcomes, providing a comprehensive evaluation of each intervention's effectiveness.

Paper Structure

This paper contains 24 sections, 2 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: The pipeline of our contrastive edits method for retrieval models. Red is for the contrastively edited stream, blue is for the default stream.
  • Figure 2: Results ($ACE_{R@1}$ metric) for text-image retrieval (TIR) on the left and image-text retrieval (ITR) on the right on Flickr dataset for all interventions.
  • Figure 3: Distributions of POS per each caption and overall in Flickr dataset.