What's in the Image? A Deep-Dive into the Vision of Vision Language Models

Omri Kaduri; Shai Bagon; Tali Dekel

What's in the Image? A Deep-Dive into the Vision of Vision Language Models

Omri Kaduri, Shai Bagon, Tali Dekel

TL;DR

A thorough empirical analysis is conducted, focusing on the attention modules across layers, by which it is revealed that the model generates surprisingly descriptive responses solely from these tokens, without direct access to image tokens.

Abstract

Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in comprehending complex visual content. However, the mechanisms underlying how VLMs process visual information remain largely unexplored. In this paper, we conduct a thorough empirical analysis, focusing on attention modules across layers. We reveal several key insights about how these models process visual data: (i) the internal representation of the query tokens (e.g., representations of "describe the image"), is utilized by VLMs to store global image information; we demonstrate that these models generate surprisingly descriptive responses solely from these tokens, without direct access to image tokens. (ii) Cross-modal information flow is predominantly influenced by the middle layers (approximately 25% of all layers), while early and late layers contribute only marginally.(iii) Fine-grained visual attributes and object details are directly extracted from image tokens in a spatially localized manner, i.e., the generated tokens associated with a specific object or attribute attend strongly to their corresponding regions in the image. We propose novel quantitative evaluation to validate our observations, leveraging real-world complex visual scenes. Finally, we demonstrate the potential of our findings in facilitating efficient visual processing in state-of-the-art VLMs.

What's in the Image? A Deep-Dive into the Vision of Vision Language Models

TL;DR

Abstract

What's in the Image? A Deep-Dive into the Vision of Vision Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)