Table of Contents
Fetching ...

Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding

Haruka Kawasaki, Ryota Tanaka, Kyosuke Nishida

Abstract

Visual document understanding (VDU) is a challenging task for large vision language models (LVLMs), requiring the integration of visual perception, text recognition, and reasoning over structured layouts. Although recent LVLMs have shown progress on VDU benchmarks, their performance is typically evaluated based on generated responses, which may not necessarily reflect whether the model has actually captured the required information internally. In this paper, we investigate how information required to solve VDU tasks is represented across different layers of LLMs within LVLMs using linear probing. Our study reveals that (1) there is a clear gap between internal representations and generated responses, and (2) information required to solve the task is often encoded more linearly from intermediate layers than from the final layer. Motivated by these findings, we explore fine-tuning strategies that target intermediate layers. Experiments show that fine-tuning intermediate layers improves both linear probing accuracy and response accuracy while narrowing the gap.

Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding

Abstract

Visual document understanding (VDU) is a challenging task for large vision language models (LVLMs), requiring the integration of visual perception, text recognition, and reasoning over structured layouts. Although recent LVLMs have shown progress on VDU benchmarks, their performance is typically evaluated based on generated responses, which may not necessarily reflect whether the model has actually captured the required information internally. In this paper, we investigate how information required to solve VDU tasks is represented across different layers of LLMs within LVLMs using linear probing. Our study reveals that (1) there is a clear gap between internal representations and generated responses, and (2) information required to solve the task is often encoded more linearly from intermediate layers than from the final layer. Motivated by these findings, we explore fine-tuning strategies that target intermediate layers. Experiments show that fine-tuning intermediate layers improves both linear probing accuracy and response accuracy while narrowing the gap.

Paper Structure

This paper contains 38 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of our analysis. We analyze the gap between the internal representations and responses in VDU. For the analysis of internal representations, we employ linear probing and construct classifiers at each layer. For the response, we evaluate the accuracy of text responses.
  • Figure 2: Examples of linear probing tasks. We use four linear probing tasks covering different aspects of VDU. These include visual attributes recognition, which targets properties such as color and shape; word recognition, which focuses on identifying spelling differences between the word in the image and the query; structure understanding, which asks about the document component highlighted in the image; and figure understanding, which requires reasoning over graphical elements in charts.
  • Figure 3: Linear probing accuracy at each layer (line plot) and text-response accuracy (horizontal black dotted line). Token types used in linear probing are divided into four categories: image-token, text-token, all-token, and last-token. The vertical axis represents accuracy, and the horizontal axis corresponds to the LLM layers, plotted every two layers.