Table of Contents
Fetching ...

LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer

Yipeng Zhang, Yifan Liu, Zonghao Guo, Yidan Zhang, Xuesong Yang, Xiaoying Zhang, Chi Chen, Jun Song, Bo Zheng, Yuan Yao, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun

TL;DR

The work tackles the shortcoming of CLIP-ViT-based multimodal LLMs in capturing fine-grained visual details. It introduces the Hiwin transformer, consisting of a Visual Detail Injection Module that creates an inverse semantic pyramid and a hierarchical window attention mechanism that compresses multi-scale semantics into spatially consistent tokens. This approach yields substantial gains across 14 benchmarks (average +3.7%), with notable DocVQA improvements, while maintaining data efficiency and reasonable training cost. The method demonstrates strong generalization to different LLM backbones and offers a versatile framework for integrating high-resolution visual representations into MLLMs.

Abstract

Vision transformers (ViTs) are widely employed in multimodal large language models (MLLMs) for visual encoding. However, they exhibit inferior performance on tasks regarding fine-grained visual perception. We attribute this to the limitations of ViTs in capturing diverse multi-modal visual levels, such as low-level details. To address this issue, we present LLaVA-UHD v2, an MLLM with advanced perception abilities by introducing a well-designed vision-language projector, the Hierarchical window (Hiwin) transformer. Hiwin transformer enhances MLLM's ability to capture diverse multi-modal visual granularities, by incorporating our constructed high-resolution semantic pyramid. Specifically, Hiwin transformer comprises two key modules: (i) a visual detail injection module, which progressively injects low-level visual details into high-level language-aligned semantics features, thereby forming an inverse semantic pyramid (ISP), and (ii) a hierarchical window attention module, which leverages cross-scale windows to condense multi-level semantics from the ISP. Extensive experiments show that LLaVA-UHD v2 outperforms compared MLLMs on a wide range of benchmarks. Notably, our design achieves an average boost of 3.7% across 14 benchmarks compared with the baseline method, 9.3% on DocVQA for instance. All the data and code will be publicly available to facilitate future research.

LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer

TL;DR

The work tackles the shortcoming of CLIP-ViT-based multimodal LLMs in capturing fine-grained visual details. It introduces the Hiwin transformer, consisting of a Visual Detail Injection Module that creates an inverse semantic pyramid and a hierarchical window attention mechanism that compresses multi-scale semantics into spatially consistent tokens. This approach yields substantial gains across 14 benchmarks (average +3.7%), with notable DocVQA improvements, while maintaining data efficiency and reasonable training cost. The method demonstrates strong generalization to different LLM backbones and offers a versatile framework for integrating high-resolution visual representations into MLLMs.

Abstract

Vision transformers (ViTs) are widely employed in multimodal large language models (MLLMs) for visual encoding. However, they exhibit inferior performance on tasks regarding fine-grained visual perception. We attribute this to the limitations of ViTs in capturing diverse multi-modal visual levels, such as low-level details. To address this issue, we present LLaVA-UHD v2, an MLLM with advanced perception abilities by introducing a well-designed vision-language projector, the Hierarchical window (Hiwin) transformer. Hiwin transformer enhances MLLM's ability to capture diverse multi-modal visual granularities, by incorporating our constructed high-resolution semantic pyramid. Specifically, Hiwin transformer comprises two key modules: (i) a visual detail injection module, which progressively injects low-level visual details into high-level language-aligned semantics features, thereby forming an inverse semantic pyramid (ISP), and (ii) a hierarchical window attention module, which leverages cross-scale windows to condense multi-level semantics from the ISP. Extensive experiments show that LLaVA-UHD v2 outperforms compared MLLMs on a wide range of benchmarks. Notably, our design achieves an average boost of 3.7% across 14 benchmarks compared with the baseline method, 9.3% on DocVQA for instance. All the data and code will be publicly available to facilitate future research.

Paper Structure

This paper contains 27 sections, 9 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Comparison of LLaVA-UHD v2 with other MLLMs. (a) MLLMs typically align ViT features to language space using MLPs liu2023llava1.5 or perceiver re-samplers Alayrac2023Flamingoli2023blip2, lacking visual granularity. (b) Combining multiple visual encoders is non-universal and computationally intensive. (c) LLaVA-UHD v2 employs the Hiwin transformer to build an inverse semantic pyramid and compress it into visual tokens, providing various semantic granularity for language generation.
  • Figure 2: The overall architecture of proposed LLaVA-UHD v2, consisting of a ViT, our hierarchical window transformer (Hiwin transformer), and an LLM. The Hiwin transformer first injects high-frequency visual details from the image into the high-level semantics of ViT features, forming inverse semantic pyramids (ISP). Then it compresses the ISPs into spatially consistent tokens via cross-scale windows, for a better vision-language alignment. Details about the two procedures are illustrated in Figure \ref{['fig:featup_module']} and \ref{['fig:hiwin-attn']}.
  • Figure 3: The flowchart illustrates the construction of the Inverse Semantic Pyramid (ISP). As the first level of ISP, $\mathcal{F}^0$ is the high-level language-aligned semantic features from CLIP-ViT. Subsequent levels, $\mathcal{F}^1$ and $\mathcal{F}^2$, are progressively built by injecting high-frequency visual details from the input image into upsampled features from the previous level, via the Visual Detail Injection Module (VDIM). A Multi-level Reconstruction (MLR) loss supervises in each scale, ensuring both text-aligned semantic coherence and fine-grained visual fidelity.
  • Figure 4: The flowchart of hierarchical window attention. We initialize a set of learnable queries to attend to local regions. Feature maps from the ISP are processed by a set of cross-scale windows, forming hierarchical and local-aware features at different levels. The features are then concatenated along the length axis, to serve as the key and value for the learnable queries. The output is condensed visual tokens rich in diverse and local-aware semantics.
  • Figure 5: Comparison of performance. (a) Performance of using different projectors on compressing ISP. Hiwin attention exhibits a significant advantage. (b) Performance of our model equipped with different LLMs.
  • ...and 8 more figures