Table of Contents
Fetching ...

ULTra: Unveiling Latent Token Interpretability in Transformer-Based Understanding and Segmentation

Hesam Hosseini, Ghazal Hosseini Mighan, Amirabbas Afzali, Sajjad Amini, Amir Houmansadr

TL;DR

ULTra addresses the challenge of interpreting latent tokens in Transformer-based understanding and segmentation by introducing a framework that backpropagates a scalar function of a target latent token through attention to produce token-specific explanation maps defined as $\overline{S}_i^{(l)}= \mathbf{C}_{i}^{(1,l)} \cdot \cdots \cdot \mathbf{C}_{i}^{(l-1,l)}$ with $S_i^{(l)} = \overline{S}_i^{(l)}[i, 1: ]$. It enables unsupervised semantic segmentation using pre-trained ViTs without fine-tuning, and further improves performance with a lightweight self-consistency learnable transformation $\mathbf{W}$ via a dedicated loss $\mathcal{L}_{\text{sc}}$. The approach is validated across vision and language tasks, achieving state-of-the-art results on multiple segmentation benchmarks and demonstrating interpretability in LLM text summarization through token-contribution analysis and a Comprehensiveness metric. While offering broad, zero-shot applicability and architectural faithfulness, the work notes the computational cost of gradient-based explanations and points to future efficiency and scalability enhancements.

Abstract

Transformers have revolutionized Computer Vision (CV) through self-attention mechanisms. However, their complexity makes latent token representations difficult to interpret. We introduce ULTra, a framework for interpreting Transformer embeddings and uncovering meaningful semantic patterns within them. ULTra enables unsupervised semantic segmentation using pre-trained models without requiring fine-tuning. Additionally, we propose a self-supervised training approach that refines segmentation performance by learning an external transformation matrix without modifying the underlying model. Our method achieves state-of-the-art performance in unsupervised semantic segmentation, outperforming existing segmentation methods. Furthermore, we validate ULTra for model interpretation on both synthetic and real-world scenarios, including Object Selection and interpretable text summarization using LLMs, demonstrating its broad applicability in explaining the semantic structure of latent token representations.

ULTra: Unveiling Latent Token Interpretability in Transformer-Based Understanding and Segmentation

TL;DR

ULTra addresses the challenge of interpreting latent tokens in Transformer-based understanding and segmentation by introducing a framework that backpropagates a scalar function of a target latent token through attention to produce token-specific explanation maps defined as with . It enables unsupervised semantic segmentation using pre-trained ViTs without fine-tuning, and further improves performance with a lightweight self-consistency learnable transformation via a dedicated loss . The approach is validated across vision and language tasks, achieving state-of-the-art results on multiple segmentation benchmarks and demonstrating interpretability in LLM text summarization through token-contribution analysis and a Comprehensiveness metric. While offering broad, zero-shot applicability and architectural faithfulness, the work notes the computational cost of gradient-based explanations and points to future efficiency and scalability enhancements.

Abstract

Transformers have revolutionized Computer Vision (CV) through self-attention mechanisms. However, their complexity makes latent token representations difficult to interpret. We introduce ULTra, a framework for interpreting Transformer embeddings and uncovering meaningful semantic patterns within them. ULTra enables unsupervised semantic segmentation using pre-trained models without requiring fine-tuning. Additionally, we propose a self-supervised training approach that refines segmentation performance by learning an external transformation matrix without modifying the underlying model. Our method achieves state-of-the-art performance in unsupervised semantic segmentation, outperforming existing segmentation methods. Furthermore, we validate ULTra for model interpretation on both synthetic and real-world scenarios, including Object Selection and interpretable text summarization using LLMs, demonstrating its broad applicability in explaining the semantic structure of latent token representations.

Paper Structure

This paper contains 30 sections, 25 equations, 23 figures, 7 tables.

Figures (23)

  • Figure 1: The overall architecture of the ULTra framework. The framework consists of a forward path, where the input data $\mathbf{x}$ is fed into the model, and a backward path, starting from the target layer $l$, where we compute the gradient of a scalar function of the $i$-th latent token, $f({\mathbf{z}_i^{(l)}})$, with respect to the attention probability matrix of the middle layer $b$. Next, we compute the corresponding contribution map $\mathbf{C}_{i}^{(b,l)}$ for all middle layers. Finally, we construct the explanation map $\overline{S}_{i}^{(l)}$, select its $i$-th row, and transform it to the input size. As an example, on the left, we observe that the token corresponding to the middle window assigns considerable attention to the left window, suggesting an underlying semantic understanding.
  • Figure 2: An example of token interpretation by our model and its predicted binary mask. (a) Original image. (b) Overlay of $\tilde{S}_i^{(13)}$ on the original image for different $i$, where the location of the $i$-th token is indicated by the purple square. (c) The binary mask $M_i^{(13)}$ for each corresponding explanation map in (b). We can generally observe that tokens clearly separate semantic entities, attending to the dog, the cat, the background, or even fine-grained attributes like the cat’s or dog's head.
  • Figure 3: ULTra segmentation results on sample images. The top row displays the original images, the middle row shows true annotations, and the bottom row presents our model’s predictions.
  • Figure 4: Two examples illustrating the model’s decision-making across layers. Columns correspond to progressively deeper layers. The first row for each image shows the CLS token, while the subsequent rows show three selected tokens (highlighted by red squares). Deeper layers capture richer semantics: in Example 1, elephant, zebra, and background tokens become increasingly distinct as we go through layers. The CLS token represents both animals but does not differentiate background types (sky, ground, or water), whereas token-level representations do. In Example 2, which contains multiple animals, the CLS token captures only the tiger and part of the elephant, while other tokens represent additional objects. Interestingly, the model confuses the parrot with part of the rainbow due to similar colors.
  • Figure 6: Visualization of Token Contribution Scores ($\lambda_i^{(l)}$) highlighting the relevance of context tokens in interpreting the summary. Each token is colored proportionally to its $\lambda_i^{(l)}$ value. These visualizations demonstrate the model's ability to identify key semantic elements in the context for generating relevant summaries. Further analysis and examples are provided in Appendix \ref{['sec:text_s']}.
  • ...and 18 more figures