Table of Contents
Fetching ...

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, Yu-Gang Jiang

TL;DR

Large multimodal models typically attach a large sequence of visual tokens to LLM inputs, incurring high compute and memory costs. DeepStack rethinks integration by stacking visual tokens into multiple transformer layers, injecting high-resolution information progressively from bottom to top without changing the baseline architecture or context length. Empirical results show consistent improvements across nine benchmarks, with notable gains on high-resolution tasks and benefits when fine-tuning vision encoders, and similar gains when applied to ViT backbones. The approach offers a simple, scalable path to richer visual understanding in LMMs.

Abstract

Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer. This paper presents a new architecture DeepStack for LMMs. Considering $N$ layers in the language and vision transformer of LMMs, we stack the visual tokens into $N$ groups and feed each group to its aligned transformer layer \textit{from bottom to top}. Surprisingly, this simple method greatly enhances the power of LMMs to model interactions among visual tokens across layers but with minimal additional cost. We apply DeepStack to both language and vision transformer in LMMs, and validate the effectiveness of DeepStack LMMs with extensive empirical results. Using the same context length, our DeepStack 7B and 13B parameters surpass their counterparts by \textbf{2.7} and \textbf{2.9} on average across \textbf{9} benchmarks, respectively. Using only one-fifth of the context length, DeepStack rivals closely to the counterparts that use the full context length. These gains are particularly pronounced on high-resolution tasks, e.g., \textbf{4.2}, \textbf{11.0}, and \textbf{4.0} improvements on TextVQA, DocVQA, and InfoVQA compared to LLaVA-1.5-7B, respectively. We further apply DeepStack to vision transformer layers, which brings us a similar amount of improvements, \textbf{3.8} on average compared with LLaVA-1.5-7B.

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

TL;DR

Large multimodal models typically attach a large sequence of visual tokens to LLM inputs, incurring high compute and memory costs. DeepStack rethinks integration by stacking visual tokens into multiple transformer layers, injecting high-resolution information progressively from bottom to top without changing the baseline architecture or context length. Empirical results show consistent improvements across nine benchmarks, with notable gains on high-resolution tasks and benefits when fine-tuning vision encoders, and similar gains when applied to ViT backbones. The approach offers a simple, scalable path to richer visual understanding in LMMs.

Abstract

Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer. This paper presents a new architecture DeepStack for LMMs. Considering layers in the language and vision transformer of LMMs, we stack the visual tokens into groups and feed each group to its aligned transformer layer \textit{from bottom to top}. Surprisingly, this simple method greatly enhances the power of LMMs to model interactions among visual tokens across layers but with minimal additional cost. We apply DeepStack to both language and vision transformer in LMMs, and validate the effectiveness of DeepStack LMMs with extensive empirical results. Using the same context length, our DeepStack 7B and 13B parameters surpass their counterparts by \textbf{2.7} and \textbf{2.9} on average across \textbf{9} benchmarks, respectively. Using only one-fifth of the context length, DeepStack rivals closely to the counterparts that use the full context length. These gains are particularly pronounced on high-resolution tasks, e.g., \textbf{4.2}, \textbf{11.0}, and \textbf{4.0} improvements on TextVQA, DocVQA, and InfoVQA compared to LLaVA-1.5-7B, respectively. We further apply DeepStack to vision transformer layers, which brings us a similar amount of improvements, \textbf{3.8} on average compared with LLaVA-1.5-7B.
Paper Structure (15 sections, 8 equations, 7 figures, 11 tables)

This paper contains 15 sections, 8 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Left: Conventional large multimodal models (LMMs) string all visual tokens into a sequence for high- and low-resolution images. Middle: Our DeepStack LMMs stack the tokens into a grid and infuse them into the first and middle transformer layers from bottom to top (■$\uparrow$■$\uparrow$■$\uparrow$ ) simply using a residual connection. With no architecture modification and context length increasing, our model can handle multiple times more visual tokens as inputs. Right: We apply DeepStack separately to Vicuna-7B (DeepStack-L) and CLIP ViT-L (DeepStack-V). Our models can take 4$\times$ more visual tokens, and significantly outperforms the sequence LMM with same context length and rival the one using a much longer context, over a wide range of benchmarks.
  • Figure 2: Architecture of DeepStack. The main innovation lies in the DeepStack strategy that infuses visual tokens into different layers. Left: DeepStack for LLMs. Given an input image, we feed the tokens extracted from the low-resolution version to the input layer of LLM. Considering the 2D nature of images, we extra the neighbors from the high-resolution version and reorganize them into DeepStack, which are then fed to the consequent layers in LLMs. Right: DeepStack for ViTs. We apply similar sampling strategy but feed the visual tokens into the ViT layers of vision encoder.
  • Figure 3: Analysis on using LLM layers to process visual tokens. (a) We insert the visual tokens into different starting layers and initialize the correspondence input embeddings as zero; (b) We fix the first layer to insert global visual tokens and ablation on the interval $s$ for stacking high-resolution tokens; (c) We ablation number of layers for token stacking.
  • Figure 4: Visualization. Both LLaVA-1.5 and DeepStack use 576 visual context length for a fair comparison. Top: We mark the area corresponding to each question with a red circle. DeepStack can well answer the questions which need high-resolution and fine-grained understanding. Bottom: DeepStack demonstrates a more accurate visual understanding in detailed visual captioning.
  • Figure 5: Visualization of three sampling methods for DeepStack.
  • ...and 2 more figures