3VL: Using Trees to Improve Vision-Language Models' Interpretability
Nir Yellinek, Leonid Karlinsky, Raja Giryes
TL;DR
This work tackles the limited compositional understanding and interpretability of Vision-Language Models by introducing 3VL, a tree-augmented framework that expands captions into hierarchical trees and couples a tree-based training loss with a standard contrastive objective. It couples this with two interpretability tools, Anchor and Differential Relevance (DiRe), leveraging HilaCAM relevance maps to produce targeted token removal and comparative heatmaps that reveal model reasoning and failure modes. Empirically, 3VL achieves strong performance on compositional benchmarks like VL-Checklist, while also delivering richer, more actionable explanations of its decisions through qualitative visualizations and quantitative interpretability assessments. Overall, the approach demonstrates that explainability-by-design, grounded in hierarchical linguistic structure, can improve both compositional reasoning and transparency in VLMs, with practical implications for debugging and bias mitigation.
Abstract
Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. However, these representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects. Moreover, VLMs typically have poor interpretability, making it challenging to debug and mitigate compositional-understanding failures. In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool. By expanding the text of an arbitrary image-text pair into a hierarchical tree structure using language analysis tools, 3VL allows the induction of this structure into the visual representation learned by the model, enhancing its interpretability and compositional reasoning. Additionally, we show how Anchor, a simple technique for text unification, can be used to filter nuisance factors while increasing CLC understanding performance, e.g., on the fundamental VL-Checklist benchmark. We also show how DiRe, which performs a differential comparison between VLM relevancy maps, enables us to generate compelling visualizations of the reasons for a model's success or failure. Our code is available at: https://github.com/niryellinek/3VL.
