Table of Contents
Fetching ...

Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model

Abdelrahman Shaker, Muhammad Maaz, Chenhui Gou, Hamid Rezatofighi, Salman Khan, Fahad Shahbaz Khan

TL;DR

Mobile-VideoGPT tackles the inefficiency of video-language models by delivering real-time video understanding with under $1$B parameters using a dual-encoder backbone, Efficient Token Projection, and a small language model. The method introduces an Attention-Based Frame Scoring module to select key frames and an ET_Proj to compress and fuse visual tokens into a unified vision-language space. It achieves up to $46$ tokens/s throughput and outperforms competitive $0.5$B-parameter baselines by approximately $6$ points on six benchmarks while using ~40% fewer parameters and >2x throughput. These results demonstrate strong practical potential for edge deployment and real-time applications, with public code available at the provided repository.

Abstract

Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: https://github.com/Amshaker/Mobile-VideoGPT.

Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model

TL;DR

Mobile-VideoGPT tackles the inefficiency of video-language models by delivering real-time video understanding with under B parameters using a dual-encoder backbone, Efficient Token Projection, and a small language model. The method introduces an Attention-Based Frame Scoring module to select key frames and an ET_Proj to compress and fuse visual tokens into a unified vision-language space. It achieves up to tokens/s throughput and outperforms competitive B-parameter baselines by approximately points on six benchmarks while using ~40% fewer parameters and >2x throughput. These results demonstrate strong practical potential for edge deployment and real-time applications, with public code available at the provided repository.

Abstract

Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: https://github.com/Amshaker/Mobile-VideoGPT.

Paper Structure

This paper contains 17 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Performance comparison of Mobile Video-GPT with competitive SoTA models across multiple video benchmarks. Mobile Video-GPT demonstrates better performance, less number of parameters, and significantly higher throughput.
  • Figure 2: Overview of the proposed Mobile-VideoGPT. The pipeline begins by extracting spatial features from all video frames with an efficient image encoder. These features then flow into an attention-based frame scoring mechanism, which identifies the top salient K frames. Next, an efficient video encoder processes these selected frames to capture temporal dynamics. The resulting spatial and temporal representations are projected into a unified vision–language space using an efficient token projector (ET‐Proj). Finally, the projected tokens are fused, and a small language model leverages these tokens to generate comprehensive responses to video‐based questions.
  • Figure 3: Overview of the Mobile-VideoGPT training strategy and the architecture of the Efficient Token Projector (ET-Proj). (a) Stage 1 focuses on pre-training the image token projector with an efficient image encoder, using ET-Proj to map image features to tokens. (b) Stage 2 extends this approach to videos by pre-training the video token projector with an efficient video encoder. (c) Stage 3 introduces instruction tuning, where both the image and video token projectors are learnable, and LoRA fine-tuning is applied to the small language model. (d) Details of the ET-Proj architecture show a lightweight design comprising a feedforward network (FFN) to refine token embeddings for desired representational capacity, a token reduction step leveraging global average pooling, and a positional encoding module with a skip connection to retain spatial and temporal context.
  • Figure 4: Qualitative comparison between the proposed Mobile Video-GPT-0.5B, LLaVA-OneVision-0.5B, and LLaVa-Mini-8B, highlighting both video comprehension quality and speed performance in terms of latency and throughput (tokens per second). Additional qualitative examples for our model are presented in the supplementary material.
  • Figure 5: Qualitative Results of Mobile Video-GPT 0.5B: Our model showcases superior performance in detailed video comprehension, effectively addressing both long and short open-ended questions, while achieving real-time throughput (tokens per second).