Table of Contents
Fetching ...

FastVLM: Efficient Vision Encoding for Vision Language Models

Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari

TL;DR

FastVLM tackles the efficiency bottleneck of high-resolution vision-language models by introducing FastViTHD, a high-resolution hybrid vision encoder that produces far fewer visual tokens and significantly reduces encoding latency. Built on a hybrid convolution-transformer backbone, FastViTHD is paired with multi-scale features to boost performance while maintaining a favorable accuracy-latency trade-off across multiple LLMs and resolutions. The approach emphasizes resolution scaling over token pruning, achieving a Pareto-optimal frontier where TTFT is dramatically reduced (e.g., up to 85x faster than prior work in some setups) with a smaller vision encoder. On-device benchmarks and extensive ablations demonstrate competitive results on text-rich tasks with fewer tokens and a leaner vision backbone, showing strong practical impact for deploying high-resolution VLMs in real-world scenarios. The work provides a scalable pathway for efficient, high-resolution vision-language understanding without sacrificing accuracy across diverse benchmarks.

Abstract

Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. Based on a comprehensive efficiency analysis of the interplay between image resolution, vision latency, token count, and LLM size, we introduce FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy. FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Unlike previous methods, FastVLM achieves the optimal balance between visual token count and image resolution solely by scaling the input image, eliminating the need for additional token pruning and simplifying the model design. In the LLaVA-1.5 setup, FastVLM achieves 3.2$\times$ improvement in time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. Compared to LLaVa-OneVision at the highest resolution (1152$\times$1152), FastVLM achieves better performance on key benchmarks like SeedBench, MMMU and DocVQA, using the same 0.5B LLM, but with 85$\times$ faster TTFT and a vision encoder that is 3.4$\times$ smaller. Code and models are available at https://github.com/apple/ml-fastvlm.

FastVLM: Efficient Vision Encoding for Vision Language Models

TL;DR

FastVLM tackles the efficiency bottleneck of high-resolution vision-language models by introducing FastViTHD, a high-resolution hybrid vision encoder that produces far fewer visual tokens and significantly reduces encoding latency. Built on a hybrid convolution-transformer backbone, FastViTHD is paired with multi-scale features to boost performance while maintaining a favorable accuracy-latency trade-off across multiple LLMs and resolutions. The approach emphasizes resolution scaling over token pruning, achieving a Pareto-optimal frontier where TTFT is dramatically reduced (e.g., up to 85x faster than prior work in some setups) with a smaller vision encoder. On-device benchmarks and extensive ablations demonstrate competitive results on text-rich tasks with fewer tokens and a leaner vision backbone, showing strong practical impact for deploying high-resolution VLMs in real-world scenarios. The work provides a scalable pathway for efficient, high-resolution vision-language understanding without sacrificing accuracy across diverse benchmarks.

Abstract

Scaling the input image resolution is essential for enhancing the performance of Vision Language Models (VLMs), particularly in text-rich image understanding tasks. However, popular visual encoders such as ViTs become inefficient at high resolutions due to the large number of tokens and high encoding latency caused by stacked self-attention layers. At different operational resolutions, the vision encoder of a VLM can be optimized along two axes: reducing encoding latency and minimizing the number of visual tokens passed to the LLM, thereby lowering overall latency. Based on a comprehensive efficiency analysis of the interplay between image resolution, vision latency, token count, and LLM size, we introduce FastVLM, a model that achieves an optimized trade-off between latency, model size and accuracy. FastVLM incorporates FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. Unlike previous methods, FastVLM achieves the optimal balance between visual token count and image resolution solely by scaling the input image, eliminating the need for additional token pruning and simplifying the model design. In the LLaVA-1.5 setup, FastVLM achieves 3.2 improvement in time-to-first-token (TTFT) while maintaining similar performance on VLM benchmarks compared to prior works. Compared to LLaVa-OneVision at the highest resolution (11521152), FastVLM achieves better performance on key benchmarks like SeedBench, MMMU and DocVQA, using the same 0.5B LLM, but with 85 faster TTFT and a vision encoder that is 3.4 smaller. Code and models are available at https://github.com/apple/ml-fastvlm.

Paper Structure

This paper contains 23 sections, 6 figures, 16 tables.

Figures (6)

  • Figure 1: FastVLM is more than 3$\times$ faster than prior work. Comparison of commonly used vision encoders for VLMs with (a) Qwen2 qwen2 0.5B LLM and (b) Vicuna 7B zheng2023judging LLM. All the vision encoders are CLIP CLIP pretrained. For a fair comparison all models are trained using LLaVA-1.5 liu2023improvedllava setup with the vision encoders made trainable for resolution adaptation, see \ref{['sec:experiments']} for more details. Marker size for each model corresponds to number of parameters of the vision encoder. The $x$-axis is the sum of vision encoder latency and LLM prefilling time. All models are benchmarked on an M1 Macbook Pro.
  • Figure 2: Overview of the FastVLM architecture. FastVLM consists of our novel vision encoder, FastViTHD, trained using the same setup as LLaVA. The FastViTHD architecture is designed for low latency at high resolution, by utilizing additional self-attention layers, and downsampling to generate 4$\times$ fewer tokens than FastViT, and 16$\times$ fewer tokens than ViT-L/14 at resolution 336.
  • Figure 3: Novel scaling strategy of FastViTHD lowers latency at various image resolutions. FastViT-Naive, a naive scaling of the FastViT architecture, and our proposed FastViTHD have the same number of parameters. ConvNeXt-L is provided for reference. All models are benchmarked on M1 Macbook Pro and trained with LLaVA-1.5 setup and Vicuna 7B. Note that the $y$-axis is in log scale.
  • Figure 4: FastViTHD improves the Pareto-Optimal curve for accuracy versus time to first token compared with FastViT. Comparison of FastViT and FastViTHD backbones paired with Qwen2 qwen2 family (chat variant) LLMs of varying sizes and different image resolutions (annotated for each point). The Pareto-optimal curve is highlighted for the two vision backbones. Training setup is LLaVA-1.5. Note that the $x$-axis is in log scale.
  • Figure 5: Vision latency dominates at high resolution. Breakdown of FastVLM's time to first token for varying image resolutions. Vision encoder is FastViTHD and LLM is Qwen2-1.5B.
  • ...and 1 more figures