Table of Contents
Fetching ...

CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

Srivathsan Sivakumar, Faisal Z. Qureshi

TL;DR

CViT introduces Cascaded-ViT, a lightweight Vision Transformer that replaces standard FFNs with Cascaded-Chunk FFNs (CCFFN) to reduce parameters, FLOPs, and energy for edge deployment. The approach cascadingly processes channel-wise chunks and aggregates progressively refined features, achieving competitive accuracy on ImageNet-1K while lowering hardware demands. A new Accuracy-Per-FLOP (APF) metric is proposed to capture compute efficiency relative to accuracy, with CViT models ranking highly across sizes and platforms, including mobile latency gains on iPhone hardware. The work demonstrates strong transferability to COCO tasks (object detection and segmentation) and provides thorough ablations showing the efficacy of the two-chunk, 2.5× expansion CCFFN design and the importance of cascading. Overall, CViT advances edge-friendly ViT design by balancing accuracy, energy, and memory efficiency, making it suitable for battery-constrained devices and real-time applications.

Abstract

Vision Transformers (ViTs) have demonstrated remarkable performance across a range of computer vision tasks; however, their high computational, memory, and energy demands hinder deployment on resource-constrained platforms. In this paper, we propose \emph{Cascaded-ViT (CViT)}, a lightweight and compute-efficient vision transformer architecture featuring a novel feedforward network design called \emph{Cascaded-Chunk Feed Forward Network (CCFFN)}. By splitting input features, CCFFN improves parameter and FLOP efficiency without sacrificing accuracy. Experiments on ImageNet-1K show that our \emph{CViT-XL} model achieves 75.5\% Top-1 accuracy while reducing FLOPs by 15\% and energy consumption by 3.3\% compared to EfficientViT-M5. Across various model sizes, the CViT family consistently exhibits the lowest energy consumption, making it suitable for deployment on battery-constrained devices such as mobile phones and drones. Furthermore, when evaluated using a new metric called \emph{Accuracy-Per-FLOP (APF)}, which quantifies compute efficiency relative to accuracy, CViT models consistently achieve top-ranking efficiency. Particularly, CViT-L is 2.2\% more accurate than EfficientViT-M2 while having comparable APF scores.

CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

TL;DR

CViT introduces Cascaded-ViT, a lightweight Vision Transformer that replaces standard FFNs with Cascaded-Chunk FFNs (CCFFN) to reduce parameters, FLOPs, and energy for edge deployment. The approach cascadingly processes channel-wise chunks and aggregates progressively refined features, achieving competitive accuracy on ImageNet-1K while lowering hardware demands. A new Accuracy-Per-FLOP (APF) metric is proposed to capture compute efficiency relative to accuracy, with CViT models ranking highly across sizes and platforms, including mobile latency gains on iPhone hardware. The work demonstrates strong transferability to COCO tasks (object detection and segmentation) and provides thorough ablations showing the efficacy of the two-chunk, 2.5× expansion CCFFN design and the importance of cascading. Overall, CViT advances edge-friendly ViT design by balancing accuracy, energy, and memory efficiency, making it suitable for battery-constrained devices and real-time applications.

Abstract

Vision Transformers (ViTs) have demonstrated remarkable performance across a range of computer vision tasks; however, their high computational, memory, and energy demands hinder deployment on resource-constrained platforms. In this paper, we propose \emph{Cascaded-ViT (CViT)}, a lightweight and compute-efficient vision transformer architecture featuring a novel feedforward network design called \emph{Cascaded-Chunk Feed Forward Network (CCFFN)}. By splitting input features, CCFFN improves parameter and FLOP efficiency without sacrificing accuracy. Experiments on ImageNet-1K show that our \emph{CViT-XL} model achieves 75.5\% Top-1 accuracy while reducing FLOPs by 15\% and energy consumption by 3.3\% compared to EfficientViT-M5. Across various model sizes, the CViT family consistently exhibits the lowest energy consumption, making it suitable for deployment on battery-constrained devices such as mobile phones and drones. Furthermore, when evaluated using a new metric called \emph{Accuracy-Per-FLOP (APF)}, which quantifies compute efficiency relative to accuracy, CViT models consistently achieve top-ranking efficiency. Particularly, CViT-L is 2.2\% more accurate than EfficientViT-M2 while having comparable APF scores.

Paper Structure

This paper contains 16 sections, 6 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: (i) Accuracy per $\text{log}_{10}$MFLOP (APF) for all evaluated models. Models are grouped by size, and higher APF indicates greater Top-1 accuracy per unit of computation. (ii) Live memory trace of CViT-L and EfficientViT-M4. CViT-L consistently demands less memory indicating system-wide memory efficiency. (iii) Reserved memory (MB) demands of CViT and EfficientViT. CViT demands less memory allocation across all scales except small.
  • Figure 2: An overview of the EfficientViT (top). The proposed architecture replaces EfficientViT Block with the Cascaded Chunk ViT Block (bottom). Cacaded Chunk ViT Block replaces the pre- and post-attention FFNs in the EfficientViT Block with Cascaded Chunk FFNs (shown in Yellow and highlighted with Red arrows).
  • Figure 3: Cascaded-Chunk FFN layer that replaces the FFN layers in EfficientViT Blocks.
  • Figure 4: An overview of weight sharing. Specifically, the pre-attention FFN modules in successive Cascaded-ViT blocks share weights, and similarly, the post-attention FFN modules also share weights across blocks. Although Cascaded-ViT blocks are organized into different stages, the current implementation applies weight sharing across stage boundaries, i.e., between all successive blocks regardless of stage. When the total number of blocks is odd, the final block does not participate in weight sharing.