How Well Can Vision Language Models See Image Details?

Chenhui Gou; Abdulwahab Felemban; Faizan Farooq Khan; Deyao Zhu; Jianfei Cai; Hamid Rezatofighi; Mohamed Elhoseiny

How Well Can Vision Language Models See Image Details?

Chenhui Gou, Abdulwahab Felemban, Faizan Farooq Khan, Deyao Zhu, Jianfei Cai, Hamid Rezatofighi, Mohamed Elhoseiny

TL;DR

The paper probes whether vision-language models based on large language models can actually perceive image details beyond high-level semantics. It introduces Pixel Value Prediction (PVP) as a probing task and a three-stage Pixel Reconstruction pretraining pipeline (with optional ViT adaptation and LoRA-based fine-tuning) to train VLMs for pixel-level understanding. Experiments show that adapting the vision encoder significantly improves pixel-detail reconstruction and that the resulting model (PAE-LaMM) achieves substantial gains on downstream tasks like Referring Image Segmentation (+10.19 cIoU on average) and video-game playing (CarRacing +70.54, Space Invaders +80.34) while preserving general VLM capabilities. These findings suggest that explicit learning of image details can be effectively integrated into VLMs, enabling finer-grained visual reasoning in multimodal tasks.

Abstract

Large Language Model-based Vision-Language Models (LLM-based VLMs) have demonstrated impressive results in various vision-language understanding tasks. However, how well these VLMs can see image detail beyond the semantic level remains unclear. In our study, we introduce a pixel value prediction task (PVP) to explore "How Well Can Vision Language Models See Image Details?" and to assist VLMs in perceiving more details. Typically, these models comprise a frozen CLIP visual encoder, a large language model, and a connecting module. After fine-tuning VLMs on the PVP task, we find: 1) existing VLMs struggle to predict precise pixel values by only fine-tuning the connection module and LLM; and 2) prediction precision is significantly improved when the vision encoder is also adapted. Additionally, our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks requiring detailed image perception, such as referring image segmentation (with an average +10.19 cIoU improvement) and video game decision making (with average score improvements of +80.34 and +70.54 on two games, respectively).

How Well Can Vision Language Models See Image Details?

TL;DR

Abstract

Paper Structure (12 sections, 1 equation, 5 figures, 8 tables)

This paper contains 12 sections, 1 equation, 5 figures, 8 tables.

Introduction
Related Work
Method
Method for investigating image perception ability of VLMs.
Pixel Reconstruction Pre-training for VLMs
Referring Image Segmentation
Video Games Playing
Experiments
Evaluation on pixel reconstruction
Results on Downstream tasks
Pixel Reconstruction Pre-training for VLM
Conclusion

Figures (5)

Figure 1: Method. a) shows our findings: Using the original CLIP vision features, VLMs can only reconstruct a blurry contour without many visual details. The reconstruction result can be improved by adapting the vision encoder. The reconstructed image is generated by querying pixel values with pixel locations, as shown in (b). For better illustration, the connection module between ViT and LLM is ignored. b) shows that we incorporate pixel prediction as a pretraining task for VLM. c) illustrates some downstream tasks performed by VLM, which require both vision detail understanding and language information. Our pretraining improves VLM performance on these tasks.
Figure 2: Examples of Game Playing by VLM. The input to the VLM is the stacked images and the game instructions. The first row shows an example of playing Carracing. The second row shows the SpaceInvaders game. The number of stacked frames depends on the expert model we used. For example, Carracing uses two frames and SpaceInvaders uses four.
Figure 3: Qualitative results of Reconstruction (a) and (d) are the GroundTruth for reconstruction. (b) and (e) is the reconstructed image of our method. (c) and (f) are the baseline result without CLIP-Vit adaptation. Compared with the baseline, our method reconstructs images with more details. The averaged Reconstruction error of our method and baseline on these 10 images are $6.67$, and $24.56$, respectively.
Figure 4: Qualitative results of Referring Image Segmentation. We first use the referring localization ability of the fine-tuned model to generate a bounding box (bbox) for the referring object, and then predict the segmentation mask inside the bbox.
Figure 5: Qualitative results of Carracing. We show the game observation from different models, including the expert Reinforcement Learning (RL) Model, Baseline Model, and Our Method, all playing under the same game seed. These images depict how each model behaves when controlling the car and approaching the same corner.

How Well Can Vision Language Models See Image Details?

TL;DR

Abstract

How Well Can Vision Language Models See Image Details?

Authors

TL;DR

Abstract

Table of Contents

Figures (5)