Table of Contents
Fetching ...

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Yonghui Wang, Wengang Zhou, Hao Feng, Houqiang Li

TL;DR

AdaptVision is presented, a multimodal large language model specifically designed to dynamically process input images at varying resolutions that mitigates distortion effects that arise from resizing images to a uniform resolution and dynamically optimizing the visual tokens input to the LLMs.

Abstract

Over the past few years, the advancement of Multimodal Large Language Models (MLLMs) has captured the wide interest of researchers, leading to numerous innovations to enhance MLLMs' comprehension. In this paper, we present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We hypothesize that the requisite number of visual tokens for the model is contingent upon both the resolution and content of the input image. Generally, natural images with a lower information density can be effectively interpreted by the model using fewer visual tokens at reduced resolutions. In contrast, images containing textual content, such as documents with rich text, necessitate a higher number of visual tokens for accurate text interpretation due to their higher information density. Building on this insight, we devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images. This method mitigates distortion effects that arise from resizing images to a uniform resolution and dynamically optimizing the visual tokens input to the LLMs. Our model is capable of processing images with resolutions up to $1008\times 1008$. Extensive experiments across various datasets demonstrate that our method achieves impressive performance in handling vision-language tasks in both natural and text-related scenes. The source code and dataset are now publicly available at \url{https://github.com/harrytea/AdaptVision}.

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

TL;DR

AdaptVision is presented, a multimodal large language model specifically designed to dynamically process input images at varying resolutions that mitigates distortion effects that arise from resizing images to a uniform resolution and dynamically optimizing the visual tokens input to the LLMs.

Abstract

Over the past few years, the advancement of Multimodal Large Language Models (MLLMs) has captured the wide interest of researchers, leading to numerous innovations to enhance MLLMs' comprehension. In this paper, we present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We hypothesize that the requisite number of visual tokens for the model is contingent upon both the resolution and content of the input image. Generally, natural images with a lower information density can be effectively interpreted by the model using fewer visual tokens at reduced resolutions. In contrast, images containing textual content, such as documents with rich text, necessitate a higher number of visual tokens for accurate text interpretation due to their higher information density. Building on this insight, we devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images. This method mitigates distortion effects that arise from resizing images to a uniform resolution and dynamically optimizing the visual tokens input to the LLMs. Our model is capable of processing images with resolutions up to . Extensive experiments across various datasets demonstrate that our method achieves impressive performance in handling vision-language tasks in both natural and text-related scenes. The source code and dataset are now publicly available at \url{https://github.com/harrytea/AdaptVision}.
Paper Structure (16 sections, 8 figures, 6 tables)

This paper contains 16 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Comparisons of image processing with LLaVA liu2024visual and Monkey li2023monkey. Our method excels at optimizing patches by processing low-resolution natural images and adaptively adjusting the input for high-resolution text-dense images to mitigate the distortions of text within images from affecting overall comprehension.
  • Figure 2: Overall architecture of AdaptVision. The process begins by splitting an image into two parts. The first is fed into the vision encoder, capturing the global information of the entire image. Meanwhile, the second undergoes adaptive segmentation via a dynamic image partition module, resulting in uniform patches that represent local features. Global features are directly projected into the word embedding space via a single linear layer. In contrast, local features undergo two layers of processing: the first aligns the dimensions with word embedding space, while the second performs dimensionality reduction, reducing the tokens to a quarter of their original count. In addition, learnable position tokens are prepended to both the global and local features to incorporate spatial context. Finally, all visual and text tokens are integrated into the LLM for further processing.
  • Figure 3: The principle of dynamic image partitioning module. We predefine a $3\times 3$ grid, where each grid cell is of the size of the input image. The subfigure in the upper left corner represents the positional tokens defined for each cell. In the remaining five subfigures, we exhibit some segment examples. The blue box outlines the image's original dimensions, while the green box indicates the image's resized dimensions, fitting the boundaries of the grid cell.
  • Figure 4: Visual comparison results of our AdaptVsion method with LLaVA-1.5 liu2023improved and Monkey li2023monkey on TextVQA singh2019towards and OCRVQA mishra2019ocr datasets.
  • Figure 5: Visual comparison results of our AdaptVision method with LLaVA-1.5 liu2023improved and Monkey li2023monkey on STVQA biten2019icdar and POIE kuang2023visual datasets.
  • ...and 3 more figures