Table of Contents
Fetching ...

Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models

Rahul Thapa, Kezhen Chen, Ian Covert, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou

TL;DR

Dragonfly introduces a multi-resolution zoom-in encoding strategy for vision-language models, combining low, medium, and high-resolution crops with mean-pooling to control token growth. This approach preserves fine-grained details (e.g., text in charts, medical imagery) while maintaining tractable context length, yielding strong general-domain performance and leading results in biomedical benchmarks with Dragonfly-Med. The work demonstrates that higher-resolution features and sub-crop featurization offer tangible gains over fixed-resolution ViTs, and provides a practical training pipeline and dataset curation to support both general and biomedical VLM tasks. Overall, the paper argues for moving beyond fixed-resolution architectures toward native, multi-scale encoding to enhance visual grounding and multimodal reasoning.

Abstract

Recent advances in vision-language models (VLMs) have demonstrated the advantages of processing images at higher resolutions and utilizing multi-crop features to preserve native resolution details. However, despite these improvements, existing vision transformers (ViTs) still struggle to capture fine-grained details from less prominent objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we extend recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it and extracting features from a large number of image sub-crops. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we demonstrate that a simple mean-pooling aggregation over tokens is effective. Our model, Dragonfly, achieves competitive performance on general-domain tasks such as ScienceQA and AI2D, and excels in tasks requiring fine-grained image understanding, including TextVQA and ChartQA. Among models in the 7-8B parameter range, Dragonfly consistently ranks at the top across ten general-domain benchmarks, achieving the highest or second-highest scores in most cases, outperforming models that are significantly larger or trained on larger datasets. Our biomedical model, Dragonfly-Med, sets new benchmarks on several medical tasks, achieving 91.6% accuracy on SLAKE (compared to 84.8% for Med-Gemini), a 67.1% token F1 score on Path-VQA (compared to 62.7% for Med-PaLM M), and state-of-the-art results across the majority of image captioning tasks. Overall, our work highlights the persistent challenge of engineering visual representations with fixed-resolution ViTs, and proposes a simple yet effective solution to address this issue and boost performance in both general and specialized domains.

Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models

TL;DR

Dragonfly introduces a multi-resolution zoom-in encoding strategy for vision-language models, combining low, medium, and high-resolution crops with mean-pooling to control token growth. This approach preserves fine-grained details (e.g., text in charts, medical imagery) while maintaining tractable context length, yielding strong general-domain performance and leading results in biomedical benchmarks with Dragonfly-Med. The work demonstrates that higher-resolution features and sub-crop featurization offer tangible gains over fixed-resolution ViTs, and provides a practical training pipeline and dataset curation to support both general and biomedical VLM tasks. Overall, the paper argues for moving beyond fixed-resolution architectures toward native, multi-scale encoding to enhance visual grounding and multimodal reasoning.

Abstract

Recent advances in vision-language models (VLMs) have demonstrated the advantages of processing images at higher resolutions and utilizing multi-crop features to preserve native resolution details. However, despite these improvements, existing vision transformers (ViTs) still struggle to capture fine-grained details from less prominent objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we extend recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it and extracting features from a large number of image sub-crops. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we demonstrate that a simple mean-pooling aggregation over tokens is effective. Our model, Dragonfly, achieves competitive performance on general-domain tasks such as ScienceQA and AI2D, and excels in tasks requiring fine-grained image understanding, including TextVQA and ChartQA. Among models in the 7-8B parameter range, Dragonfly consistently ranks at the top across ten general-domain benchmarks, achieving the highest or second-highest scores in most cases, outperforming models that are significantly larger or trained on larger datasets. Our biomedical model, Dragonfly-Med, sets new benchmarks on several medical tasks, achieving 91.6% accuracy on SLAKE (compared to 84.8% for Med-Gemini), a 67.1% token F1 score on Path-VQA (compared to 62.7% for Med-PaLM M), and state-of-the-art results across the majority of image captioning tasks. Overall, our work highlights the persistent challenge of engineering visual representations with fixed-resolution ViTs, and proposes a simple yet effective solution to address this issue and boost performance in both general and specialized domains.
Paper Structure (31 sections, 4 figures, 12 tables)

This paper contains 31 sections, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Examples generated by Dragonfly, showcasing its diverse capabilities, including world knowledge and humor, multi-turn question-answering, OCR, and chart understanding.
  • Figure 2: Overview of our proposed Dragonfly framework. The original image is resized into low, medium, and high resolutions. The medium- and high-resolution images are divided into crops matching the encoder's training resolution. All sub-crops pass through a shared vision encoder to produce visual tokens. The projection layer then maps the visual tokens to the language space. Afterward, the mean-pooling layer reduces the embeddings from each sub-crop into 36 tokens.
  • Figure 3: Examples of biomedical VQA. The figure shows three questions along with their gold standard answers and the corresponding responses from the Dragonfly-Med model.
  • Figure 4: Ratio of maximum resolution of our high resolution image to the native resolution of the original image. We used all of our training dataset to calculate this ratio, which comprised data from multiple different sources and tasks. First, we matched each image into one of the aspect ratios with the algorithm mentioned in \ref{['subsection:experimental_setup']}. Then, we calculated the ratio between the longest dimension in our max-resolution to the longest dimension in the native resolution of the image. From the plot, we can see that 65% of the images in our training cohort are zoomed-in by at least 4x the native resolution.