Table of Contents
Fetching ...

LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs

Shichu Sun, Yichen Zhang, Haolin Song, Zonghao Guo, Chi Chen, Yidan Zhang, Yuan Yao, Zhiyuan Liu, Maosong Sun

TL;DR

This work compares global native-resolution encoding with slice-based encoding in multimodal LLMs and identifies superior cross-modal understanding but higher computational cost for the former. It introduces Progressive Visual Compression (PVC), combining Refined Patch Embedding and Windowed Token Compression to retrofit pretrained ViTs into efficient native-resolution encoders, yielding ViT-UHD and the downstream LLaVA-UHD v3. Empirical results show ViT-UHD achieves strong accuracy-efficiency trade-offs and LLaVA-UHD v3 attains competitive performance with up to 1.9× TTFT improvement over baselines across 15 benchmarks. The approach enables scalable, high-resolution vision-language models and is accompanied by data-, code-, and checkpoint releases to facilitate future work.

Abstract

Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.

LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs

TL;DR

This work compares global native-resolution encoding with slice-based encoding in multimodal LLMs and identifies superior cross-modal understanding but higher computational cost for the former. It introduces Progressive Visual Compression (PVC), combining Refined Patch Embedding and Windowed Token Compression to retrofit pretrained ViTs into efficient native-resolution encoders, yielding ViT-UHD and the downstream LLaVA-UHD v3. Empirical results show ViT-UHD achieves strong accuracy-efficiency trade-offs and LLaVA-UHD v3 attains competitive performance with up to 1.9× TTFT improvement over baselines across 15 benchmarks. The approach enables scalable, high-resolution vision-language models and is accompanied by data-, code-, and checkpoint releases to facilitate future work.

Abstract

Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.

Paper Structure

This paper contains 43 sections, 6 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: ViT-UHD and LLaVA-UHD v3 exhibit a superior trade-off between efficiency and performance. (a) Within the LLaVA training paradigm, ViT-UHD achieves higher average performance across 6 benchmarks, such as MMBench and AI2D, compared to state-of-the-art vision encoders, while maintaining substantially greater computational efficiency ($e.g.$, achieving a 2.4$\times$ reduction in latency relative to MoonViT). (b) LLaVA-UHD v3 attains performance comparable to advanced MLLMs ($e.g.$, Qwen2-VL) across 15 diverse benchmarks, while delivering 1.9$\times$ efficiency gains.
  • Figure 2: ShapeGrid and model performance. (a) Examples from ShapeGrid bench, with each subset matched with color boxes. (b) Performance comparison between global native-resolution encoding (GNE) and slice-based encoding (SBE) across different general benchmarks and ShapeGrid subsets.
  • Figure 3: Illustration and analysis on ShapeGrid-Sudoku subset. (a) Example from the Sudoku subset. (b) Accuracy heatmap of models with global native-resolution encoding (GNE) vs. slice-based encoding (SBE) on the Sudoku subset. (c) Attention score bias map showing the difference in attention activation between GNE and SBE.
  • Figure 4: Overview of the architecture of LLaVA-UHD v3. ViT-UHD first utilizes Refined Patch Embedding (RPE) to tokenize images with native-resolution into fine-grained tokens. Window Token Compression (WTC) modules are inserted at multiple stages to progressively reduce token length while learning local semantics. The final vision tokens are then projected into the LLM.
  • Figure 5: Efficiency of varying compression position. Optimal trade-off are shown in the light blue region. 4$\times$ denotes a different compression ratio.
  • ...and 3 more figures