3VL: Using Trees to Improve Vision-Language Models' Interpretability

Nir Yellinek; Leonid Karlinsky; Raja Giryes

3VL: Using Trees to Improve Vision-Language Models' Interpretability

Nir Yellinek, Leonid Karlinsky, Raja Giryes

TL;DR

This work tackles the limited compositional understanding and interpretability of Vision-Language Models by introducing 3VL, a tree-augmented framework that expands captions into hierarchical trees and couples a tree-based training loss with a standard contrastive objective. It couples this with two interpretability tools, Anchor and Differential Relevance (DiRe), leveraging HilaCAM relevance maps to produce targeted token removal and comparative heatmaps that reveal model reasoning and failure modes. Empirically, 3VL achieves strong performance on compositional benchmarks like VL-Checklist, while also delivering richer, more actionable explanations of its decisions through qualitative visualizations and quantitative interpretability assessments. Overall, the approach demonstrates that explainability-by-design, grounded in hierarchical linguistic structure, can improve both compositional reasoning and transparency in VLMs, with practical implications for debugging and bias mitigation.

Abstract

Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows the induction of this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be used to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also show how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model's success or failure. Our code is available at: https://github.com/niryellinek/3VL.

3VL: Using Trees to Improve Vision-Language Models' Interpretability

TL;DR

Abstract

Paper Structure (25 sections, 3 equations, 10 figures, 12 tables)

This paper contains 25 sections, 3 equations, 10 figures, 12 tables.

Introduction
Background
Interpretability of deep neural networks
Tree usage for deep learning
Compositionality in Vision-Language Models
The Tree-augmented Vision-Language (3VL) model
Caption tree generation
Tree-based training
Relevancy Maps based Token Removal and Interpretability
Token Removal
HilaCAM Anchor
Differential Relevance (DiRe)
Experiments
3VL Compositional Language Concepts Evaluation
3VL Visual Spatial Reasoning (VSR) Evaluation
...and 10 more sections

Figures (10)

Figure 1: The caption tree generation flow: (i) parse the sentence to get noun phrases and part of speech (ii) hierarchically reconstruct the caption (iii) generate negatives for each sub-caption (iv) compose the final tree.
Figure 2: The tree loss and contrastive loss that are used for training 3VL. For the tree loss we first generate a caption tree and then sum the cross entropy loss in all tree levels. For the contrastive loss, we calculate the average cross-entropy loss over all image-text pairs in the batch.
Figure 3: Generating one relevancy heatmap using an "Anchor" text from two text possibilites. Note that unlike Figure \ref{['diag:two_texts_HilaCAM']}, instead of having two heatmaps, here we have only one heatmap that is generated from the "Anchor" text that we create from the positive and negative texts.
Figure 4: When we have two possible texts for one image, we may apply HilaCAM two times to get two different relevancy heatmaps.
Figure 5: 3VL Token Removal accuracy on VL-Checklist (average of Attribute and Relation). HilaCAM vs. Anchor vs. DiRe for both 3VL and vanilla CLIP. Notice that 3VL gets better accuracy compared to vanilla CLIP and its relative improvement with Anchor is better. Note that better improvement by Token Removal indicates that a better understanding of the token importance is gained and therefore there is a better interpretability.
...and 5 more figures

3VL: Using Trees to Improve Vision-Language Models' Interpretability

TL;DR

Abstract

3VL: Using Trees to Improve Vision-Language Models' Interpretability

Authors

TL;DR

Abstract

Table of Contents

Figures (10)