Table of Contents
Fetching ...

OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference

Wei Chen, Zhiyuan Li, Shuo Xin

TL;DR

OmniVLM tackles on-device vision-language reasoning by 1) introducing a token-compression module that reduces image tokens from 729 to 81, enabling sub-billion parameter operation (968M) and efficient edge inference, and 2) applying a minimal-edit Direct Preference Optimization (DPO) training pipeline to improve output quality with minimal parameter updates. Built on a SigLIP-384 vision encoder and a Qwen2.5‑based language backbone, the approach uses a three-stage training regimen (pretraining, supervised fine-tuning, and DPO) to achieve competitive performance on ScienceQA, POPE, MM-VET, and MMMU while delivering substantial latency and throughput gains on laptops and mobile devices. Empirical results show a ~9× faster time-to-first-token and ~1.5× higher decoding speed on a laptop, and strong mobile performance, demonstrating practical viability for on-device multimodal AI. The work advances edge-deployable multimodal systems by combining aggressive yet effective token compression with targeted preference learning, and provides an accessible release for real-world use on consumer hardware.

Abstract

We present OmniVLM, a sub-billion-parameter vision-language model for efficient on-device inference. OmniVLM introduces a token compression mechanism that reduces visual token sequence length from 729 to 81 tokens, significantly reducing computational overhead while preserving visual-semantic fidelity. Through a multi-stage training pipeline of pretraining, supervised fine-tuning, and minimal-edit Direct Preference Optimization (DPO), OmniVLM matches the performance of larger models. On multiple benchmarks including ScienceQA, POPE, and MMMU, OmniVLM outperforms existing baselines like nanoLLAVA within a 968M-parameter footprint. Empirical results on the same laptop demonstrate 9.1x faster time-to-first-token (0.75s vs 6.82s) and 1.5x higher decoding speed (29.41 vs 19.20 tokens/s) compared to nanoLLAVA, enabling efficient deployment on edge devices. The model weights can be accessed on huggingface: \url{https://huggingface.co/NexaAIDev/OmniVLM-968M}, and the inference examples can be find in Appendix B.

OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference

TL;DR

OmniVLM tackles on-device vision-language reasoning by 1) introducing a token-compression module that reduces image tokens from 729 to 81, enabling sub-billion parameter operation (968M) and efficient edge inference, and 2) applying a minimal-edit Direct Preference Optimization (DPO) training pipeline to improve output quality with minimal parameter updates. Built on a SigLIP-384 vision encoder and a Qwen2.5‑based language backbone, the approach uses a three-stage training regimen (pretraining, supervised fine-tuning, and DPO) to achieve competitive performance on ScienceQA, POPE, MM-VET, and MMMU while delivering substantial latency and throughput gains on laptops and mobile devices. Empirical results show a ~9× faster time-to-first-token and ~1.5× higher decoding speed on a laptop, and strong mobile performance, demonstrating practical viability for on-device multimodal AI. The work advances edge-deployable multimodal systems by combining aggressive yet effective token compression with targeted preference learning, and provides an accessible release for real-world use on consumer hardware.

Abstract

We present OmniVLM, a sub-billion-parameter vision-language model for efficient on-device inference. OmniVLM introduces a token compression mechanism that reduces visual token sequence length from 729 to 81 tokens, significantly reducing computational overhead while preserving visual-semantic fidelity. Through a multi-stage training pipeline of pretraining, supervised fine-tuning, and minimal-edit Direct Preference Optimization (DPO), OmniVLM matches the performance of larger models. On multiple benchmarks including ScienceQA, POPE, and MMMU, OmniVLM outperforms existing baselines like nanoLLAVA within a 968M-parameter footprint. Empirical results on the same laptop demonstrate 9.1x faster time-to-first-token (0.75s vs 6.82s) and 1.5x higher decoding speed (29.41 vs 19.20 tokens/s) compared to nanoLLAVA, enabling efficient deployment on edge devices. The model weights can be accessed on huggingface: \url{https://huggingface.co/NexaAIDev/OmniVLM-968M}, and the inference examples can be find in Appendix B.

Paper Structure

This paper contains 19 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: OmniVLM model architecture
  • Figure 2: Validation loss curves across different image token compression ratios, evaluated on a test dataset comprising around 500K text-image pairs. The comparison demonstrates the effect of token reduction from the baseline (729 tokens) to various compression levels (243, 81, and 9 tokens).
  • Figure 3: OmniVLM model benchmark.
  • Figure 4: Example of art description.
  • Figure 5: Example of complex scene analysis.
  • ...and 3 more figures