CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

Srivathsan Sivakumar; Faisal Z. Qureshi

CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

Srivathsan Sivakumar, Faisal Z. Qureshi

TL;DR

CViT introduces Cascaded-ViT, a lightweight Vision Transformer that replaces standard FFNs with Cascaded-Chunk FFNs (CCFFN) to reduce parameters, FLOPs, and energy for edge deployment. The approach cascadingly processes channel-wise chunks and aggregates progressively refined features, achieving competitive accuracy on ImageNet-1K while lowering hardware demands. A new Accuracy-Per-FLOP (APF) metric is proposed to capture compute efficiency relative to accuracy, with CViT models ranking highly across sizes and platforms, including mobile latency gains on iPhone hardware. The work demonstrates strong transferability to COCO tasks (object detection and segmentation) and provides thorough ablations showing the efficacy of the two-chunk, 2.5× expansion CCFFN design and the importance of cascading. Overall, CViT advances edge-friendly ViT design by balancing accuracy, energy, and memory efficiency, making it suitable for battery-constrained devices and real-time applications.

Abstract

Vision Transformers (ViTs) have demonstrated remarkable performance across a range of computer vision tasks; however, their high computational, memory, and energy demands hinder deployment on resource-constrained platforms. In this paper, we propose \emph{Cascaded-ViT (CViT)}, a lightweight and compute-efficient vision transformer architecture featuring a novel feedforward network design called \emph{Cascaded-Chunk Feed Forward Network (CCFFN)}. By splitting input features, CCFFN improves parameter and FLOP efficiency without sacrificing accuracy. Experiments on ImageNet-1K show that our \emph{CViT-XL} model achieves 75.5\% Top-1 accuracy while reducing FLOPs by 15\% and energy consumption by 3.3\% compared to EfficientViT-M5. Across various model sizes, the CViT family consistently exhibits the lowest energy consumption, making it suitable for deployment on battery-constrained devices such as mobile phones and drones. Furthermore, when evaluated using a new metric called \emph{Accuracy-Per-FLOP (APF)}, which quantifies compute efficiency relative to accuracy, CViT models consistently achieve top-ranking efficiency. Particularly, CViT-L is 2.2\% more accurate than EfficientViT-M2 while having comparable APF scores.

CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

TL;DR

Abstract

CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)