Table of Contents
Fetching ...

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang

TL;DR

InfiniteVL addresses the long-context limitation of Vision-Language Models by hybridizing Sliding Window Attention with Gated DeltaNet to achieve linear-time memory for unlimited multimodal input. It couples this architecture with a three-stage training pipeline—distillation pretraining, instruction tuning, and long-sequence SFT—to achieve competitive performance with Transformer-based VLMs while delivering substantial inference speedups and constant memory. The approach demonstrates robust long-term memory in streaming video and long-context tasks, maintaining real-time performance (24 FPS) on open benchmarks with limited data. The work offers a deployment-friendly path to high-capacity, long-context VLMs suitable for edge devices and streaming applications without external memory modules.

Abstract

Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

TL;DR

InfiniteVL addresses the long-context limitation of Vision-Language Models by hybridizing Sliding Window Attention with Gated DeltaNet to achieve linear-time memory for unlimited multimodal input. It couples this architecture with a three-stage training pipeline—distillation pretraining, instruction tuning, and long-sequence SFT—to achieve competitive performance with Transformer-based VLMs while delivering substantial inference speedups and constant memory. The approach demonstrates robust long-term memory in streaming video and long-context tasks, maintaining real-time performance (24 FPS) on open benchmarks with limited data. The work offers a deployment-friendly path to high-capacity, long-context VLMs suitable for edge devices and streaming applications without external memory modules.

Abstract

Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.

Paper Structure

This paper contains 31 sections, 5 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Efficiency and performance of InfiniteVL. Left: Under comparable performance, InfiniteVL significantly improves single-GPU training throughput per day, streaming FPS, inference cache usage, and per-token latency over Qwen2.5VL-3B. Right: Speed–performance trade-off among VLMs, where InfiniteVL achieves real-time 24 streaming FPS with competitive performance at similar model scale. All inference results are measured on a single NVIDIA RTX 4090.
  • Figure 2: Architecture of InfiniteVL. Visual inputs (images, videos, real-time streams) are embedded by a naïve-resolution ViT and text by a tokenizer, then concatenated and processed by a stack of Hybrid Blocks. Each Hybrid SWA module for local, linear-time modeling and three Gated DeltaNet layers that read from and write to a fixed-size memory cache to capture long-range dependencies, enabling context-length–agnostic inference with constant throughput and GPU memory.
  • Figure 3: Three-stage training strategy of InfiniteVL. The student model is initialized from a full-attention teacher, replacing its attention layers with Gated DeltaNet while inheriting all remaining parameters. Stage I performs layer-wise and end-to-end distillation to align Gated DeltaNet with the teacher. Stage II applies large-scale supervised fine-tuning on diverse multimodal instruction data to build strong instruction-following and reasoning abilities. Stage III conducts long-sequence SFT with additional high-resolution, document, and video QA/Caption data to enhance length generalization.
  • Figure 4: Length generalization and inference efficiency of InfiniteVL:(a--b) On Video-MME and LongVideoBench, InfiniteVL delivers stable performance as the number of input frames increases, whereas Qwen2.5-VL-3B(SWA) degrades once the context length exceeds its attention window. (c) InfiniteVL attains over $3.6\times$ lower per-token latency than a transformer-based VLM of similar size. (d) InfiniteVL maintains real-time streaming inference at $\approx24$ FPS with 274 tokens per frame, while Qwen2.5-VL-3B rapidly slows down and eventually runs out of memory.
  • Figure 5: L2 norm of the Linear-layer memory cache versus input frame index: the norm increases rapidly at the beginning and then stabilizes.
  • ...and 3 more figures