Table of Contents
Fetching ...

Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting

Qiyang Yu, Yu Fang, Tianrui Li, Xuemei Cao, Yan Chen, Jianghao Li, Fan Min

TL;DR

Grc-ViT presents a dynamic granularity Vision Transformer that routes images to coarse or fine patch configurations using a differentiable complexity estimator based on edge, entropy, and frequency cues. A granularity-adaptive shared Transformer backbone with lightweight adapters processes multi-scale tokens under a single attention core, achieving improved fine-grained discrimination with reduced computational cost. The model is trained with learnable thresholds $alpha$ and $beta$, enabling end-to-end granularity routing, and experiments demonstrate superior accuracy–efficiency trade-offs on both standard and fine-grained benchmarks. By integrating granular computing principles into ViT design, this approach offers a scalable, interpretable framework for balancing global reasoning and local detail in vision tasks.

Abstract

Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global dependencies but often struggle to efficiently represent fine-grained local details. Existing multi-scale approaches alleviate this issue by integrating hierarchical or hybrid features; however, they rely on fixed patch sizes and introduce redundant computation. To address these limitations, we propose Granularity-driven Vision Transformer (Grc-ViT), a dynamic coarse-to-fine framework that adaptively adjusts visual granularity based on image complexity. It comprises two key stages: (1) Coarse Granularity Evaluation module, which assesses visual complexity using edge density, entropy, and frequency-domain cues to estimate suitable patch and window sizes; (2) Fine-grained Refinement module, which refines attention computation according to the selected granularity, enabling efficient and precise feature learning. Two learnable parameters, α and \b{eta}, are optimized end-to-end to balance global reasoning and local perception. Comprehensive evaluations demonstrate that Grc-ViT enhances fine-grained discrimination while achieving a superior trade-off between accuracy and computational efficiency.

Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting

TL;DR

Grc-ViT presents a dynamic granularity Vision Transformer that routes images to coarse or fine patch configurations using a differentiable complexity estimator based on edge, entropy, and frequency cues. A granularity-adaptive shared Transformer backbone with lightweight adapters processes multi-scale tokens under a single attention core, achieving improved fine-grained discrimination with reduced computational cost. The model is trained with learnable thresholds and , enabling end-to-end granularity routing, and experiments demonstrate superior accuracy–efficiency trade-offs on both standard and fine-grained benchmarks. By integrating granular computing principles into ViT design, this approach offers a scalable, interpretable framework for balancing global reasoning and local detail in vision tasks.

Abstract

Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global dependencies but often struggle to efficiently represent fine-grained local details. Existing multi-scale approaches alleviate this issue by integrating hierarchical or hybrid features; however, they rely on fixed patch sizes and introduce redundant computation. To address these limitations, we propose Granularity-driven Vision Transformer (Grc-ViT), a dynamic coarse-to-fine framework that adaptively adjusts visual granularity based on image complexity. It comprises two key stages: (1) Coarse Granularity Evaluation module, which assesses visual complexity using edge density, entropy, and frequency-domain cues to estimate suitable patch and window sizes; (2) Fine-grained Refinement module, which refines attention computation according to the selected granularity, enabling efficient and precise feature learning. Two learnable parameters, α and \b{eta}, are optimized end-to-end to balance global reasoning and local perception. Comprehensive evaluations demonstrate that Grc-ViT enhances fine-grained discrimination while achieving a superior trade-off between accuracy and computational efficiency.

Paper Structure

This paper contains 23 sections, 12 equations, 5 figures, 11 tables, 2 algorithms.

Figures (5)

  • Figure 1: Multi-sized windows in Grc-ViT, with fine-grained windows containing more fine-grained details (left) and coarse-grained windows designed to extract global features (right). Compared to Swin Transformer, which uses a single fixed-size window, Grc-ViT enables more combinations of perceptual regions as windows and patches are varied.
  • Figure 2: Effect of granularity on feature extraction. (a) shows the hotspot map of the attention mechanism on the original picture. Respectively, (b), (c), and (d) show the hotspot maps of the attention mechanism on the feature maps at different granularity levels.
  • Figure 3: Structural diagram of the model of Grc-ViT. Grc-ViT optimizes feature extraction through a dual-granularity synergistic mechanism, where coarse-graining guides fine-graining cascading while taking into account efficiency and accuracy, and verifies the appropriateness of local information distribution and granularity selection. Coarse granularity globally filters global key regions, quickly locates potential semantic units, and reduces computational redundancy. Fine-grained based on the results of coarse-grained, dynamic selection of multi-level granularity, gradually refine local features.
  • Figure 4: The process of fine-tuning parameters $\alpha$ and $\beta$.
  • Figure 5: (a) Comparative histogram of image entropy at different Patch_size values. (b) Line chart of pixel intensity distribution at different Patch_size values. (c) Radar chart of Fourier transform spectrum at different Patch_size values.

Theorems & Definitions (2)

  • Definition 1
  • Definition 2