EVCC: Enhanced Vision Transformer-ConvNeXt-CoAtNet Fusion for Classification

Kazi Reyazul Hasan; Md Nafiu Rahman; Wasif Jalal; Sadif Ahmed; Shahriar Raj; Mubasshira Musarrat; Muhammad Abdullah Adnan

EVCC: Enhanced Vision Transformer-ConvNeXt-CoAtNet Fusion for Classification

Kazi Reyazul Hasan, Md Nafiu Rahman, Wasif Jalal, Sadif Ahmed, Shahriar Raj, Mubasshira Musarrat, Muhammad Abdullah Adnan

TL;DR

EVCC proposes a tri-branch fusion of ViT, ConvNeXt, and CoAtNet to achieve high accuracy with significant efficiency gains. By integrating adaptive token pruning, gated bidirectional cross-attention, and a confidence-aware routing mechanism, EVCC dynamically balances global and local cues and reduces unnecessary computation. Empirical results across CIFAR-100, Tobacco3482, CelebA, and Brain MRI demonstrate competitive or state-of-the-art accuracy with 25–35% FLOPs reduction, along with strong edge-device performance. The approach offers practical utility for edge-enabled image classification while providing a flexible framework for future multi-branch fusion and multi-task learning.

Abstract

Hybrid vision architectures combining Transformers and CNNs have significantly advanced image classification, but they usually do so at significant computational cost. We introduce EVCC (Enhanced Vision Transformer-ConvNeXt-CoAtNet), a novel multi-branch architecture integrating the Vision Transformer, lightweight ConvNeXt, and CoAtNet through key innovations: (1) adaptive token pruning with information preservation, (2) gated bidirectional cross-attention for enhanced feature refinement, (3) auxiliary classification heads for multi-task learning, and (4) a dynamic router gate employing context-aware confidence-driven weighting. Experiments across the CIFAR-100, Tobacco3482, CelebA, and Brain Cancer datasets demonstrate EVCC's superiority over powerful models like DeiT-Base, MaxViT-Base, and CrossViT-Base by consistently achieving state-of-the-art accuracy with improvements of up to 2 percentage points, while reducing FLOPs by 25 to 35%. Our adaptive architecture adjusts computational demands to deployment needs by dynamically reducing token count, efficiently balancing the accuracy-efficiency trade-off while combining global context, local details, and hierarchical features for real-world applications. The source code of our implementation is available at https://anonymous.4open.science/r/EVCC.

EVCC: Enhanced Vision Transformer-ConvNeXt-CoAtNet Fusion for Classification

TL;DR

Abstract

EVCC: Enhanced Vision Transformer-ConvNeXt-CoAtNet Fusion for Classification

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)