Table of Contents
Fetching ...

GrootVL: Tree Topology is All You Need in State Space Model

Yicheng Xiao, Lin Song, Shaoli Huang, Jiangshan Wang, Siyu Song, Yixiao Ge, Xiu Li, Ying Shan

TL;DR

The GrootVL network is proposed, which first dynamically generates a tree topology based on spatial relationships and input features, and then feature propagation is performed based on this graph, thereby breaking the original sequence constraints to achieve stronger representation capabilities.

Abstract

The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency. However, constrained by the inherent geometric constraints of sequences, it still falls short in modeling long-range dependencies. To address this issue, we propose the GrootVL network, which first dynamically generates a tree topology based on spatial relationships and input features. Then, feature propagation is performed based on this graph, thereby breaking the original sequence constraints to achieve stronger representation capabilities. Additionally, we introduce a linear complexity dynamic programming algorithm to enhance long-range interactions without increasing computational cost. GrootVL is a versatile multimodal framework that can be applied to both visual and textual tasks. Extensive experiments demonstrate that our method significantly outperforms existing structured state space models on image classification, object detection and segmentation. Besides, by fine-tuning large language models, our approach achieves consistent improvements in multiple textual tasks at minor training cost.

GrootVL: Tree Topology is All You Need in State Space Model

TL;DR

The GrootVL network is proposed, which first dynamically generates a tree topology based on spatial relationships and input features, and then feature propagation is performed based on this graph, thereby breaking the original sequence constraints to achieve stronger representation capabilities.

Abstract

The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency. However, constrained by the inherent geometric constraints of sequences, it still falls short in modeling long-range dependencies. To address this issue, we propose the GrootVL network, which first dynamically generates a tree topology based on spatial relationships and input features. Then, feature propagation is performed based on this graph, thereby breaking the original sequence constraints to achieve stronger representation capabilities. Additionally, we introduce a linear complexity dynamic programming algorithm to enhance long-range interactions without increasing computational cost. GrootVL is a versatile multimodal framework that can be applied to both visual and textual tasks. Extensive experiments demonstrate that our method significantly outperforms existing structured state space models on image classification, object detection and segmentation. Besides, by fine-tuning large language models, our approach achieves consistent improvements in multiple textual tasks at minor training cost.
Paper Structure (42 sections, 15 equations, 7 figures, 7 tables, 2 algorithms)

This paper contains 42 sections, 15 equations, 7 figures, 7 tables, 2 algorithms.

Figures (7)

  • Figure 1: Comparison of different propagation strategies for multi-modal tasks. For visual tasks, the previous strategies (a) are based on fixed patterns, while our method can adaptively generate the propagation topology according to input features. For textual tasks, compared to previous methods (c), our approach (d) can break the inherent constraints of text sequences, facilitating the effective transmission of long-range information.
  • Figure 2: Illustration of Tree State Space Model. With an image feature map $x$, we perform Tree Scanning Algorithm (TSA) to construct a $4$-connected graph with edge weights measured by dissimilarity between pixels. Then, we obtain an MST with vertices set $\Omega$ through a pruning algorithm and perform the state transition for each vertex in this topology (detailed in \ref{['sec:tree-scan']}). Red arrows describe the propagation source of vertex $i$.
  • Figure 3: Overview of GrootV. LN means LayerNorm and FFN is a feed-forward network in the basic block. S2 and P1 denote stride of $2$ and padding size of $1$ in convolution, respectively.
  • Figure 4: Visualization of affinity maps in the specific position. The Location is marked by the red cross in each input (a). TP is our tree topology scanning algorithm (b), which captures more detailed structural information and has a larger receptive field compared to raster scanning (c).
  • Figure 5: Semantic segmentation performance on ADE20K val set. The crop size is all set to $512^2$. SS and MS denote single-scale and multi-scale testing, respectively.
  • ...and 2 more figures