Table of Contents
Fetching ...

EVCC: Enhanced Vision Transformer-ConvNeXt-CoAtNet Fusion for Classification

Kazi Reyazul Hasan, Md Nafiu Rahman, Wasif Jalal, Sadif Ahmed, Shahriar Raj, Mubasshira Musarrat, Muhammad Abdullah Adnan

TL;DR

EVCC proposes a tri-branch fusion of ViT, ConvNeXt, and CoAtNet to achieve high accuracy with significant efficiency gains. By integrating adaptive token pruning, gated bidirectional cross-attention, and a confidence-aware routing mechanism, EVCC dynamically balances global and local cues and reduces unnecessary computation. Empirical results across CIFAR-100, Tobacco3482, CelebA, and Brain MRI demonstrate competitive or state-of-the-art accuracy with 25–35% FLOPs reduction, along with strong edge-device performance. The approach offers practical utility for edge-enabled image classification while providing a flexible framework for future multi-branch fusion and multi-task learning.

Abstract

Hybrid vision architectures combining Transformers and CNNs have significantly advanced image classification, but they usually do so at significant computational cost. We introduce EVCC (Enhanced Vision Transformer-ConvNeXt-CoAtNet), a novel multi-branch architecture integrating the Vision Transformer, lightweight ConvNeXt, and CoAtNet through key innovations: (1) adaptive token pruning with information preservation, (2) gated bidirectional cross-attention for enhanced feature refinement, (3) auxiliary classification heads for multi-task learning, and (4) a dynamic router gate employing context-aware confidence-driven weighting. Experiments across the CIFAR-100, Tobacco3482, CelebA, and Brain Cancer datasets demonstrate EVCC's superiority over powerful models like DeiT-Base, MaxViT-Base, and CrossViT-Base by consistently achieving state-of-the-art accuracy with improvements of up to 2 percentage points, while reducing FLOPs by 25 to 35%. Our adaptive architecture adjusts computational demands to deployment needs by dynamically reducing token count, efficiently balancing the accuracy-efficiency trade-off while combining global context, local details, and hierarchical features for real-world applications. The source code of our implementation is available at https://anonymous.4open.science/r/EVCC.

EVCC: Enhanced Vision Transformer-ConvNeXt-CoAtNet Fusion for Classification

TL;DR

EVCC proposes a tri-branch fusion of ViT, ConvNeXt, and CoAtNet to achieve high accuracy with significant efficiency gains. By integrating adaptive token pruning, gated bidirectional cross-attention, and a confidence-aware routing mechanism, EVCC dynamically balances global and local cues and reduces unnecessary computation. Empirical results across CIFAR-100, Tobacco3482, CelebA, and Brain MRI demonstrate competitive or state-of-the-art accuracy with 25–35% FLOPs reduction, along with strong edge-device performance. The approach offers practical utility for edge-enabled image classification while providing a flexible framework for future multi-branch fusion and multi-task learning.

Abstract

Hybrid vision architectures combining Transformers and CNNs have significantly advanced image classification, but they usually do so at significant computational cost. We introduce EVCC (Enhanced Vision Transformer-ConvNeXt-CoAtNet), a novel multi-branch architecture integrating the Vision Transformer, lightweight ConvNeXt, and CoAtNet through key innovations: (1) adaptive token pruning with information preservation, (2) gated bidirectional cross-attention for enhanced feature refinement, (3) auxiliary classification heads for multi-task learning, and (4) a dynamic router gate employing context-aware confidence-driven weighting. Experiments across the CIFAR-100, Tobacco3482, CelebA, and Brain Cancer datasets demonstrate EVCC's superiority over powerful models like DeiT-Base, MaxViT-Base, and CrossViT-Base by consistently achieving state-of-the-art accuracy with improvements of up to 2 percentage points, while reducing FLOPs by 25 to 35%. Our adaptive architecture adjusts computational demands to deployment needs by dynamically reducing token count, efficiently balancing the accuracy-efficiency trade-off while combining global context, local details, and hierarchical features for real-world applications. The source code of our implementation is available at https://anonymous.4open.science/r/EVCC.

Paper Structure

This paper contains 19 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overall pipeline of the EVCC architecture
  • Figure 2: Input image $x$ is processed through two parallel branches. Branch A: Features from ViT ($E_{\text{ViT}}$) and ConvNeXt ($E_{\text{Conv}}$) are projected to embedding space ($Z_v$, $Z_c$), undergo adaptive token pruning where top-$k$ tokens are selected and pruned information is preserved in $Z_{\text{pool}}$, then pass through $L$ bidirectional cross-attention blocks where ViT2Conv attention ($\text{Attn}_{v \rightarrow c}$) and Conv2ViT attention ($\text{Attn}_{c \rightarrow v}$) are modulated by learnable gates ($G_v$, $G_c$). After global pooling, this yields $F_v$ and $F_c$. Branch B: CoAtNet processes features independently to produce $Z_x$.
  • Figure 3: GradCAM attention visualization demonstrating the effectiveness of our adaptive fusion strategy on CIFAR-100
  • Figure 4: The Router Gate concatenates all features $[F_v; F_c; Z_x]$, computes routing weights $\pi$ and confidence score $\text{conf}$, then produces final representation $F_{\text{final}} = \sum_{i=0}^2 \pi_\text{final}[i] \cdot F_i'$ that feeds the main classifier and auxiliary classifiers. The losses are calculated via the classifers which are eventually merged to get the final loss.
  • Figure 5: Document classification results on Tobacco-3482. EVCC correctly distinguishes visually similar document types in most cases by utlizing complementary features from different branches