Table of Contents
Fetching ...

Approximate Attributions for Off-the-Shelf Siamese Transformers

Lucas Möller, Dmitry Nikolaev, Sebastian Padó

TL;DR

This work addresses the interpretability gap for off-the-shelf Siamese transformers by (i) proposing an exact attribution model that preserves predictive performance while using cosine similarity, and (ii) introducing approximate attributions that can be applied without modifying the model. The method extends integrated Jacobians to Siamese setups and shows how to compute token-level attributions that sum to the model prediction under a neutral reference, with extensions to cosine similarity and approximate references. Empirically, the tuned model nearly matches the exact-attribution model in predictive accuracy, and approximate attributions prove reliable for deep representations while highlighting limitations for shallower layers; analyses reveal that SEs attend to core syntactic roles, largely ignore negation, and exhibit lexical biases. The findings guide practical use of attribution methods for SEs and motivate future work to train exact-attribution-capable Siamese models from scratch.

Abstract

Siamese encoders such as sentence transformers are among the least understood deep models. Established attribution methods cannot tackle this model class since it compares two inputs rather than processing a single one. To address this gap, we have recently proposed an attribution method specifically for Siamese encoders (Möller et al., 2023). However, it requires models to be adjusted and fine-tuned and therefore cannot be directly applied to off-the-shelf models. In this work, we reassess these restrictions and propose (i) a model with exact attribution ability that retains the original model's predictive performance and (ii) a way to compute approximate attributions for off-the-shelf models. We extensively compare approximate and exact attributions and use them to analyze the models' attendance to different linguistic aspects. We gain insights into which syntactic roles Siamese transformers attend to, confirm that they mostly ignore negation, explore how they judge semantically opposite adjectives, and find that they exhibit lexical bias.

Approximate Attributions for Off-the-Shelf Siamese Transformers

TL;DR

This work addresses the interpretability gap for off-the-shelf Siamese transformers by (i) proposing an exact attribution model that preserves predictive performance while using cosine similarity, and (ii) introducing approximate attributions that can be applied without modifying the model. The method extends integrated Jacobians to Siamese setups and shows how to compute token-level attributions that sum to the model prediction under a neutral reference, with extensions to cosine similarity and approximate references. Empirically, the tuned model nearly matches the exact-attribution model in predictive accuracy, and approximate attributions prove reliable for deep representations while highlighting limitations for shallower layers; analyses reveal that SEs attend to core syntactic roles, largely ignore negation, and exhibit lexical biases. The findings guide practical use of attribution methods for SEs and motivate future work to train exact-attribution-capable Siamese models from scratch.

Abstract

Siamese encoders such as sentence transformers are among the least understood deep models. Established attribution methods cannot tackle this model class since it compares two inputs rather than processing a single one. To address this gap, we have recently proposed an attribution method specifically for Siamese encoders (Möller et al., 2023). However, it requires models to be adjusted and fine-tuned and therefore cannot be directly applied to off-the-shelf models. In this work, we reassess these restrictions and propose (i) a model with exact attribution ability that retains the original model's predictive performance and (ii) a way to compute approximate attributions for off-the-shelf models. We extensively compare approximate and exact attributions and use them to analyze the models' attendance to different linguistic aspects. We gain insights into which syntactic roles Siamese transformers attend to, confirm that they mostly ignore negation, explore how they judge semantically opposite adjectives, and find that they exhibit lexical bias.
Paper Structure (28 sections, 4 equations, 12 figures, 4 tables)

This paper contains 28 sections, 4 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Attributions for the same example in the Exact (top) and Tuned (bottom) models. Plots include individual terms from the LHSs of Equation \ref{['eq:two_inputs']}.
  • Figure 2: Contributions of reference similarities, $f(\mathbf{a},\mathbf{r}_b)$ and $f(\mathbf{b},\mathbf{r}_a)$ (left), and the reference term, $f(\mathbf{r}_a,\mathbf{r}_b)$ (right), to attributions.
  • Figure 3: Spearman correlation between attributions from the Tuned and Exact model for all STS test set pairs (y axis) plotted against the mean predicted similarity of both models (x axis).
  • Figure 4: Agreement between attributions by the Tuned and Exact model. We compute Spearman and Pearson correlations, as well as the intersections between the top-3 and top-10 attributions for different layers and similarity scores $s>0.5$.
  • Figure 5: The relationships (with LOWESS smoothing) between sums of positive and negative elements of attribution matrices computed on the STS test set using the Shelf and the Exact model (left pane of top and bottom row, respectively) and the distribution of sums of positive elements in these matrices (right pane).
  • ...and 7 more figures