Efficient Architectures for High Resolution Vision-Language Models
Miguel Carvalho, Bruno Martins
TL;DR
Efficient high-resolution Vision-Language Models (VLMs) are needed to capture fine-grained details in images, including scene-text. The paper introduces Pheye, a parameter-efficient architecture that freezes a language model and a CLIP-based vision encoder, with dense cross-attention and dual LoRA adapters to process high-resolution images via global and local patches. In experiments, Pheye achieves competitive results on TextVQA and related tasks with far fewer trainable parameters, and higher image resolution yields stronger gains on fine-detail tasks, enabling better performance on scene-text without OCR tokens provided in prompts. The work demonstrates a practical path to deploy high-resolution VLMs on resource-constrained hardware and suggests future directions such as alternative vision encoders, synthetic data for scene-text, and multilingual capabilities.
Abstract
Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.
