Efficient Architectures for High Resolution Vision-Language Models

Miguel Carvalho; Bruno Martins

Efficient Architectures for High Resolution Vision-Language Models

Miguel Carvalho, Bruno Martins

TL;DR

Efficient high-resolution Vision-Language Models (VLMs) are needed to capture fine-grained details in images, including scene-text. The paper introduces Pheye, a parameter-efficient architecture that freezes a language model and a CLIP-based vision encoder, with dense cross-attention and dual LoRA adapters to process high-resolution images via global and local patches. In experiments, Pheye achieves competitive results on TextVQA and related tasks with far fewer trainable parameters, and higher image resolution yields stronger gains on fine-detail tasks, enabling better performance on scene-text without OCR tokens provided in prompts. The work demonstrates a practical path to deploy high-resolution VLMs on resource-constrained hardware and suggests future directions such as alternative vision encoders, synthetic data for scene-text, and multilingual capabilities.

Abstract

Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.

Efficient Architectures for High Resolution Vision-Language Models

TL;DR

Abstract

Paper Structure (19 sections, 4 equations, 4 figures, 5 tables)

This paper contains 19 sections, 4 equations, 4 figures, 5 tables.

Introduction
An Efficient High-Resolution VLM
The Proposed Architecture
Analysis of the Computational Complexity
Vision Encoder.
Language Model.
Main Experimental Evaluation
Experimental Setup
$\bullet$ Stage I.
$\bullet$ Stage II.
$\bullet$ Stage III.
Experimental Results
Assessing the Use of Fine Image Details
Conclusions
Data Mixture for Final Training Stage
...and 4 more sections

Figures (4)

Figure 1: Overview on the proposed architecture, where input images are split into regular non-overlapping patches that match the input resolution of a pre-trained ViT. Two sets of LoRA adapters are respectively used to adjust the ViT to both global and local sub-images, and a frozen LLM is conditioned on the concatenated vision representations through dense cross-attention layers.
Figure 2: An illustration for dense cross-attention layers. To condition the language model on visual inputs, we add new cross-attention layers between existing pre-trained and frozen language model layers. The keys and values for these layers are derived from vision features, while the queries come from language inputs. These layers are followed by dense feed-forward layers. The output matrices of both of these modules are initialized with values close to zero to maintain the integrity of the language model at initialization.
Figure 3: Architecture for high resolution multi-patch image encoding.
Figure 4: Attention scores for global patch tokens across data tertiles that reflect the relative dimensions of relevant image areas. Both graphs were calculated using the Pheye-x4 models. The average cross-attention score for the local patches is given by $1 - A_{G}$, were $A_{G}$ denotes the cross-attention scores for the global patch tokens.

Efficient Architectures for High Resolution Vision-Language Models

TL;DR

Abstract

Efficient Architectures for High Resolution Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)