Table of Contents
Fetching ...

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Miguel Carvalho, Helder Dias, Bruno Martins

TL;DR

CropVLM presents a lightweight, external cropping module trained with reinforcement learning to dynamically focus VLMs on informative image regions, boosting fine-grained perception without modifying the target VLM. The method uses Group Relative Policy Optimization with task-aligned rewards (accuracy or log-likelihood) and crops encoded as percentage-based bounding boxes, enabling bounding-box-free supervision. It demonstrates consistent gains across multiple VLMs and high-resolution, out-of-domain tasks, while maintaining low computational overhead compared to full-resolution processing or alternative cropping strategies. The work highlights the practicality of external, low-parameter cropping to improve detailed visual understanding while preserving the integrity of the frozen target model and offering robust generalization. It also discusses limitations, including multilingual generalization and potential biases in region selection, calling for broader fairness-aware evaluation in cropping-based VLM pipelines.

Abstract

Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

TL;DR

CropVLM presents a lightweight, external cropping module trained with reinforcement learning to dynamically focus VLMs on informative image regions, boosting fine-grained perception without modifying the target VLM. The method uses Group Relative Policy Optimization with task-aligned rewards (accuracy or log-likelihood) and crops encoded as percentage-based bounding boxes, enabling bounding-box-free supervision. It demonstrates consistent gains across multiple VLMs and high-resolution, out-of-domain tasks, while maintaining low computational overhead compared to full-resolution processing or alternative cropping strategies. The work highlights the practicality of external, low-parameter cropping to improve detailed visual understanding while preserving the integrity of the frozen target model and offering robust generalization. It also discusses limitations, including multilingual generalization and potential biases in region selection, calling for broader fairness-aware evaluation in cropping-based VLM pipelines.

Abstract

Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.

Paper Structure

This paper contains 31 sections, 1 equation, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Overview of CropVLM paired with LLaVA. CropVLM dynamically selects informative image regions to boost fine-grained perception while keeping the target VLM frozen.
  • Figure 2: The overall CropVLM training procedure. The orange and purple lines represent training with an accuracy-based reward or with a log-likelihood reward, respectively.
  • Figure 3: TextVQA performance across multiple bounding box expansion factors, using human-annotated annotations, and with SmolVLM at 512$\times$512 and 2048$\times$2048 input resolutions.
  • Figure 4: Qualitative examples from the V* Benchmark, where the first 6 cases are successful and the last 2 are failures. Next to each image, we present the question alongside responses from GPT 4.1 nano, and GPT 4.1 nano paired with the CropVLM model that accepts images at 2048x2048 pixels of input resolution, and which was trained using log-likelihood rewards. The red bounding box denotes the CropVLM proposed region of interest.
  • Figure 5: Qualitative examples from TextVQA, where the first 6 cases are successful and the last 2 are failures. Next to each image, we present the question alongside responses from GPT 4.1 nano, and GPT 4.1 nano paired with the CropVLM model that accepts images at 2048x2048 pixels of input resolution, and which was trained using log-likelihood rewards. The red bounding box denotes the CropVLM proposed region of interest.