OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference

Wei Chen; Zhiyuan Li; Shuo Xin

OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference

Wei Chen, Zhiyuan Li, Shuo Xin

TL;DR

OmniVLM tackles on-device vision-language reasoning by 1) introducing a token-compression module that reduces image tokens from 729 to 81, enabling sub-billion parameter operation (968M) and efficient edge inference, and 2) applying a minimal-edit Direct Preference Optimization (DPO) training pipeline to improve output quality with minimal parameter updates. Built on a SigLIP-384 vision encoder and a Qwen2.5‑based language backbone, the approach uses a three-stage training regimen (pretraining, supervised fine-tuning, and DPO) to achieve competitive performance on ScienceQA, POPE, MM-VET, and MMMU while delivering substantial latency and throughput gains on laptops and mobile devices. Empirical results show a ~9× faster time-to-first-token and ~1.5× higher decoding speed on a laptop, and strong mobile performance, demonstrating practical viability for on-device multimodal AI. The work advances edge-deployable multimodal systems by combining aggressive yet effective token compression with targeted preference learning, and provides an accessible release for real-world use on consumer hardware.

Abstract

We present OmniVLM, a sub-billion-parameter vision-language model for efficient on-device inference. OmniVLM introduces a token compression mechanism that reduces visual token sequence length from 729 to 81 tokens, significantly reducing computational overhead while preserving visual-semantic fidelity. Through a multi-stage training pipeline of pretraining, supervised fine-tuning, and minimal-edit Direct Preference Optimization (DPO), OmniVLM matches the performance of larger models. On multiple benchmarks including ScienceQA, POPE, and MMMU, OmniVLM outperforms existing baselines like nanoLLAVA within a 968M-parameter footprint. Empirical results on the same laptop demonstrate 9.1x faster time-to-first-token (0.75s vs 6.82s) and 1.5x higher decoding speed (29.41 vs 19.20 tokens/s) compared to nanoLLAVA, enabling efficient deployment on edge devices. The model weights can be accessed on huggingface: \url{https://huggingface.co/NexaAIDev/OmniVLM-968M}, and the inference examples can be find in Appendix B.

OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference

TL;DR

Abstract

OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)