Table of Contents
Fetching ...

Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo

TL;DR

Problem: streaming inference for multimodal LLMs is hindered by quadratic attention costs, large KV caches, and memory limits on edge devices. Approach: Inf-MLLM introduces attention saddles and a retrieval-window KV eviction with attention bias, plus length extrapolation to extend context without fine-tuning. Contributions: a size-constrained KV cache eviction mechanism, dynamic updating via attention bias, and demonstrations across text and video benchmarks showing improved perplexity, memory efficiency, and long-term memory on a single GPU and edge devices. Impact: enables practical deployment of multimodal LLMs in streaming settings, including long conversations and long video streams, on resource-constrained hardware.

Abstract

Multimodal Large Language Models (MLLMs) are distinguished by their multimodal comprehensive ability and widely used in many real-world applications including GPT-4o, autonomous driving and robotics. Despite their impressive performance, the multimodal inputs always incur long context. The inference under long context requires caching massive Key and Value states (KV cache) of previous tokens, which introduces high latency and excessive memory consumption. Due to this reason, it is challenging to deploy streaming inference of MLLMs on edge devices, which largely constrains the power and usage of MLLMs in real-world applications. In this paper, we introduce Inf-MLLM, an efficient inference framework for MLLMs, which enable streaming inference of MLLM on a single GPU with infinite context. Inf-MLLM is based on our key observation of the attention pattern in both LLMs and MLLMs called "attention saddles". Thanks to the newly discovered attention pattern, Inf-MLLM maintains a size-constrained KV cache by dynamically caching recent tokens and relevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel approach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM enables multiple LLMs and MLLMs to achieve stable performance over 4M-token long texts and multi-round conversations with 1-hour-long videos on a single GPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than existing methods such as StreamingLLM and 2x speedup than H2O.

Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

TL;DR

Problem: streaming inference for multimodal LLMs is hindered by quadratic attention costs, large KV caches, and memory limits on edge devices. Approach: Inf-MLLM introduces attention saddles and a retrieval-window KV eviction with attention bias, plus length extrapolation to extend context without fine-tuning. Contributions: a size-constrained KV cache eviction mechanism, dynamic updating via attention bias, and demonstrations across text and video benchmarks showing improved perplexity, memory efficiency, and long-term memory on a single GPU and edge devices. Impact: enables practical deployment of multimodal LLMs in streaming settings, including long conversations and long video streams, on resource-constrained hardware.

Abstract

Multimodal Large Language Models (MLLMs) are distinguished by their multimodal comprehensive ability and widely used in many real-world applications including GPT-4o, autonomous driving and robotics. Despite their impressive performance, the multimodal inputs always incur long context. The inference under long context requires caching massive Key and Value states (KV cache) of previous tokens, which introduces high latency and excessive memory consumption. Due to this reason, it is challenging to deploy streaming inference of MLLMs on edge devices, which largely constrains the power and usage of MLLMs in real-world applications. In this paper, we introduce Inf-MLLM, an efficient inference framework for MLLMs, which enable streaming inference of MLLM on a single GPU with infinite context. Inf-MLLM is based on our key observation of the attention pattern in both LLMs and MLLMs called "attention saddles". Thanks to the newly discovered attention pattern, Inf-MLLM maintains a size-constrained KV cache by dynamically caching recent tokens and relevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel approach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM enables multiple LLMs and MLLMs to achieve stable performance over 4M-token long texts and multi-round conversations with 1-hour-long videos on a single GPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than existing methods such as StreamingLLM and 2x speedup than H2O.
Paper Structure (19 sections, 7 figures, 4 tables, 1 algorithm)

This paper contains 19 sections, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of the streaming inference process. The bottom figure shows that Inf-MLLM facilitates existing MLLMs to handle streams of texts and videos without OOM while maintaining high-quality token generation.
  • Figure 2: Attention maps with typical patterns. We take some layers from the MLLM model, Chat-UniVi-7B, as example.
  • Figure 3: The illustration of KV cache eviction. It happens when a new prompt comes during streaming inference.
  • Figure 4: The illustration of attention bias to adjust the distribution of attention scores during streaming inference.
  • Figure 5: LLM perplexity comparison on the Wiki-Text-103 dataset with different context lengths.
  • ...and 2 more figures