CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception
Miguel Carvalho, Helder Dias, Bruno Martins
TL;DR
CropVLM presents a lightweight, external cropping module trained with reinforcement learning to dynamically focus VLMs on informative image regions, boosting fine-grained perception without modifying the target VLM. The method uses Group Relative Policy Optimization with task-aligned rewards (accuracy or log-likelihood) and crops encoded as percentage-based bounding boxes, enabling bounding-box-free supervision. It demonstrates consistent gains across multiple VLMs and high-resolution, out-of-domain tasks, while maintaining low computational overhead compared to full-resolution processing or alternative cropping strategies. The work highlights the practicality of external, low-parameter cropping to improve detailed visual understanding while preserving the integrity of the frozen target model and offering robust generalization. It also discusses limitations, including multilingual generalization and potential biases in region selection, calling for broader fairness-aware evaluation in cropping-based VLM pipelines.
Abstract
Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.
