Table of Contents
Fetching ...

Empirical Recipes for Efficient and Compact Vision-Language Models

Jiabo Huang, Zhizhong Li, Sina Sajadmanesh, Weiming Zhuang, Lingjuan Lyu

Abstract

Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency while preserving accuracy. These techniques cut time to first token (TTFT) by 53% on InternVL3-2B and by 93% on SmolVLM-256M. Our recipes are broadly applicable across both VLM architectures and common serving frameworks, providing practical guidance for building efficient VLM systems. Beyond efficiency, we study how to extend compact VLMs with structured perception outputs and introduce the resulting model family, ArgusVLM. Across diverse benchmarks, ArgusVLM achieves strong performance while maintaining a compact and efficient design.

Empirical Recipes for Efficient and Compact Vision-Language Models

Abstract

Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency while preserving accuracy. These techniques cut time to first token (TTFT) by 53% on InternVL3-2B and by 93% on SmolVLM-256M. Our recipes are broadly applicable across both VLM architectures and common serving frameworks, providing practical guidance for building efficient VLM systems. Beyond efficiency, we study how to extend compact VLMs with structured perception outputs and introduce the resulting model family, ArgusVLM. Across diverse benchmarks, ArgusVLM achieves strong performance while maintaining a compact and efficient design.
Paper Structure (34 sections, 1 equation, 7 figures, 4 tables)

This paper contains 34 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: ArgusVLM excels at both performance and inference efficiency with a faster time-to-first-token (TTFT) compared to existing compact VLMs. Each bubble represents a model variant, where the area indicates model size.
  • Figure 2: ArgusVLM-2B achieves strong performance on image understanding and captioning tasks across five benchmarks compared with leading vision-language models such as QwenVL bai2025qwen2 and InternVL zhu2025internvl3.
  • Figure 3: Flame graph from profiling VLM inference with austinaustin to identify CPU-side bottlenecks. The visualization separates processes for easier analysis: Process 2953917 performs multimodal preprocessing, while Process 2954325 runs model inference. Best viewed digitally with zoom.
  • Figure 4: GPU profiling of SmolVLM-256M inference under vLLM with (a) bfloat16 and (b) W8A8 quantization. The timelines show GPU kernels within a transformer layer during decoding, with the average execution time ($\mu$s) annotated below each kernel. Despite reduced precision, the quantized model is slower.
  • Figure 5: Overview of the ArgusVLM architecture. ArgusVLM consists of a Vision Transformer (ViT), a single-layer MLP, and a large language model (LLM). Given an input image, we first resize it to the closest aspect ratio from a predefined set, then split it into square tiles that are resized to the ViT input resolution; we also generate a global thumbnail to provide scene-level context. Each tile (and the thumbnail) is independently encoded by the ViT into patch tokens, which are then compressed by concatenating features from spatially adjacent patches along the channel dimension. In parallel, the user instruction is tokenized into text tokens. Finally, the visual and text tokens are concatenated and fed to the LLM, which generates the response autoregressively.
  • ...and 2 more figures