Table of Contents
Fetching ...

LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments

Jin Huang, Yuchao Jin, Le An, Josh Park

TL;DR

LiteVLM targets real-time vision-language inference on embedded hardware by combining patch-view filtering, token-level pruning, and speculative decoding to cut end-to-end latency. The method reduces computation across vision encoder, LLM prefill, and autoregressive stages, with FP8 quantization further boosting throughput. Evaluations on NVIDIA DRIVE Thor with DriveLM show up to $3.2\times$ speedups while maintaining task performance, enabling deployment in robotics and autonomous systems. The work offers a practical blueprint for edge-optimized VLMs and suggests avenues for quantization-aware training and broader VQA applicability.

Abstract

This paper introduces an efficient Vision-Language Model (VLM) pipeline specifically optimized for deployment on embedded devices, such as those used in robotics and autonomous driving. The pipeline significantly reduces the computational overhead by jointly leveraging patch selection to filter irrelevant camera views, a token selection module to reduce input sequence length for the LLM, and speculative decoding to accelerate token generation. Evaluation on the NVIDIA DRIVE Thor platform for automonous driving application, our pipeline achieves $2.5\times$ end-to-end latency reduction without compromising task accuracy. The speed-up further increases to $3.2\times$ when applying FP8 post-training quantization. These results demonstrate our pipeline as a viable solution for enabling real-time VLM deployment in resource-constrained environments.

LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments

TL;DR

LiteVLM targets real-time vision-language inference on embedded hardware by combining patch-view filtering, token-level pruning, and speculative decoding to cut end-to-end latency. The method reduces computation across vision encoder, LLM prefill, and autoregressive stages, with FP8 quantization further boosting throughput. Evaluations on NVIDIA DRIVE Thor with DriveLM show up to speedups while maintaining task performance, enabling deployment in robotics and autonomous systems. The work offers a practical blueprint for edge-optimized VLMs and suggests avenues for quantization-aware training and broader VQA applicability.

Abstract

This paper introduces an efficient Vision-Language Model (VLM) pipeline specifically optimized for deployment on embedded devices, such as those used in robotics and autonomous driving. The pipeline significantly reduces the computational overhead by jointly leveraging patch selection to filter irrelevant camera views, a token selection module to reduce input sequence length for the LLM, and speculative decoding to accelerate token generation. Evaluation on the NVIDIA DRIVE Thor platform for automonous driving application, our pipeline achieves end-to-end latency reduction without compromising task accuracy. The speed-up further increases to when applying FP8 post-training quantization. These results demonstrate our pipeline as a viable solution for enabling real-time VLM deployment in resource-constrained environments.

Paper Structure

This paper contains 11 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: The proposed framework builds upon a Vision-Language Model by introducing two novel modules: the Patch Selection Module and the Token Selection Module (highlighted in Orange). We also incorporate Speculative Decoding Head (highlighted in Blue) to accelerate the decoding process. Together, these components enable efficient token generation for real-time applications.
  • Figure 2: Average end-to-end latency of our proposed pipeline compared to baseline 2B VLMs across different configurations.
  • Figure 3: The latency of different stages of VLM 2B of FP16 and FP8 execution on NVIDIA DRIVE Thor Platform