Table of Contents
Fetching ...

VisualRWKV-HD and UHD: Advancing High-Resolution Processing for Visual Language Models

Zihang Li, Haowen Hou

TL;DR

This paper presents VisualRWKV-HD and VisualRWKV-UHD, two advancements in the VisualRWKV model family, specifically designed to process high-resolution visual inputs, and developed a lossless downsampling method to effectively integrate a high-resolution vision encoder with low-resolution encoders, without extending the input sequence length.

Abstract

Accurately understanding complex visual information is crucial for visual language models (VLMs). Enhancing image resolution can improve visual perception capabilities, not only reducing hallucinations but also boosting performance in tasks that demand high resolution, such as text-rich or document analysis. In this paper, we present VisualRWKV-HD and VisualRWKV-UHD, two advancements in the VisualRWKV model family, specifically designed to process high-resolution visual inputs. For VisualRWKV-HD, we developed a lossless downsampling method to effectively integrate a high-resolution vision encoder with low-resolution encoders, without extending the input sequence length. For the VisualRWKV-UHD model, we enhanced image representation by dividing the image into four segments, which are then recombined with the original image. This technique allows the model to incorporate both high-resolution and low-resolution features, effectively balancing coarse and fine-grained information. As a result, the model supports resolutions up to 4096 x 4096 pixels, offering a more detailed and comprehensive visual processing capability. Both VisualRWKV-HD and VisualRWKV-UHD not only achieve strong results on VLM benchmarks but also show marked improvements in performance for text-rich tasks.

VisualRWKV-HD and UHD: Advancing High-Resolution Processing for Visual Language Models

TL;DR

This paper presents VisualRWKV-HD and VisualRWKV-UHD, two advancements in the VisualRWKV model family, specifically designed to process high-resolution visual inputs, and developed a lossless downsampling method to effectively integrate a high-resolution vision encoder with low-resolution encoders, without extending the input sequence length.

Abstract

Accurately understanding complex visual information is crucial for visual language models (VLMs). Enhancing image resolution can improve visual perception capabilities, not only reducing hallucinations but also boosting performance in tasks that demand high resolution, such as text-rich or document analysis. In this paper, we present VisualRWKV-HD and VisualRWKV-UHD, two advancements in the VisualRWKV model family, specifically designed to process high-resolution visual inputs. For VisualRWKV-HD, we developed a lossless downsampling method to effectively integrate a high-resolution vision encoder with low-resolution encoders, without extending the input sequence length. For the VisualRWKV-UHD model, we enhanced image representation by dividing the image into four segments, which are then recombined with the original image. This technique allows the model to incorporate both high-resolution and low-resolution features, effectively balancing coarse and fine-grained information. As a result, the model supports resolutions up to 4096 x 4096 pixels, offering a more detailed and comprehensive visual processing capability. Both VisualRWKV-HD and VisualRWKV-UHD not only achieve strong results on VLM benchmarks but also show marked improvements in performance for text-rich tasks.

Paper Structure

This paper contains 28 sections, 2 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: VisualRWKV-HD and UHD Overview. The input image is processed through three vision encoders and a high-resolution strategy, followed by a multi-layer perceptron (MLP) with context gating to generate image features.
  • Figure 2: UHD-strategy: We divide the input image into four sections, which are then processed by the SigLip, DINOv2, and SAM encoders. The resulting features from each section are merged and passed through avgpool2d. These pooled features are concatenated with the HD-Feature generated from previous steps, ultimately producing the final UHD-Feature.
  • Figure 3: Distribution of the HD559k dataset, showcasing the various datasets and their respective quantities. This comprehensive dataset includes a diverse range of sources, contributing to a total of 559,494 images utilized for training and evaluation purposes.
  • Figure 4: Distribution of the HD667k dataset, illustrating the composition and quantity of various datasets included. With a total of 667,000 images, this dataset encompasses a wide array of visual tasks and sources, aimed at enhancing the training and evaluation of model performance.