VISTA: A Visual and Textual Attention Dataset for Interpreting Multimodal Models
Harshit, Tolga Tasdizen
TL;DR
This paper addresses the interpretability gap in Vision-Language Models by studying how model-derived attention aligns with human visual attention during image-text tasks. It introduces VISTA, a human-annotated dataset that pairs eye-tracking saliency with image descriptions, and uses KDE-derived attention maps to quantify correspondence. The authors evaluate multiple image-text alignment and open-vocabulary segmentation models with NCC and AUC, finding that CLIP-Seg and BLIP-ITM-Base show the strongest alignment with human attention, while some models like ViLT struggle. The dataset provides a principled benchmark and data-driven means to improve transparency and trustworthiness in multimodal systems.
Abstract
The recent developments in deep learning led to the integration of natural language processing (NLP) with computer vision, resulting in powerful integrated Vision and Language Models (VLMs). Despite their remarkable capabilities, these models are frequently regarded as black boxes within the machine learning research community. This raises a critical question: which parts of an image correspond to specific segments of text, and how can we decipher these associations? Understanding these connections is essential for enhancing model transparency, interpretability, and trustworthiness. To answer this question, we present an image-text aligned human visual attention dataset that maps specific associations between image regions and corresponding text segments. We then compare the internal heatmaps generated by VL models with this dataset, allowing us to analyze and better understand the model's decision-making process. This approach aims to enhance model transparency, interpretability, and trustworthiness by providing insights into how these models align visual and linguistic information. We conducted a comprehensive study on text-guided visual saliency detection in these VL models. This study aims to understand how different models prioritize and focus on specific visual elements in response to corresponding text segments, providing deeper insights into their internal mechanisms and improving our ability to interpret their outputs.
