Table of Contents
Fetching ...

CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

Chunlei Meng, Jiacheng Yang, Wei Lin, Bowen Liu, Hongda Zhang, chun ouyang, Zhongxue Gan

TL;DR

The CNN-Transformer Aggregation Network (CTA-Net) was developed, which combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features to enable efficient processing of detailed local and broader contextual information.

Abstract

Convolutional neural networks (CNNs) and vision transformers (ViTs) have become essential in computer vision for local and global feature extraction. However, aggregating these architectures in existing methods often results in inefficiencies. To address this, the CNN-Transformer Aggregation Network (CTA-Net) was developed. CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features. This integration enables efficient processing of detailed local and broader contextual information. CTA-Net introduces the Light Weight Multi-Scale Feature Fusion Multi-Head Self-Attention (LMF-MHSA) module for effective multi-scale feature integration with reduced parameters. Additionally, the Reverse Reconstruction CNN-Variants (RRCV) module enhances the embedding of CNNs within the transformer architecture. Extensive experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance (TOP-1 Acc 86.76\%), fewer parameters (20.32M), and greater efficiency (FLOPs 2.83B), making it a highly efficient and lightweight solution for visual tasks on small-scale datasets (fewer than 100,000).

CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction

TL;DR

The CNN-Transformer Aggregation Network (CTA-Net) was developed, which combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features to enable efficient processing of detailed local and broader contextual information.

Abstract

Convolutional neural networks (CNNs) and vision transformers (ViTs) have become essential in computer vision for local and global feature extraction. However, aggregating these architectures in existing methods often results in inefficiencies. To address this, the CNN-Transformer Aggregation Network (CTA-Net) was developed. CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features. This integration enables efficient processing of detailed local and broader contextual information. CTA-Net introduces the Light Weight Multi-Scale Feature Fusion Multi-Head Self-Attention (LMF-MHSA) module for effective multi-scale feature integration with reduced parameters. Additionally, the Reverse Reconstruction CNN-Variants (RRCV) module enhances the embedding of CNNs within the transformer architecture. Extensive experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance (TOP-1 Acc 86.76\%), fewer parameters (20.32M), and greater efficiency (FLOPs 2.83B), making it a highly efficient and lightweight solution for visual tasks on small-scale datasets (fewer than 100,000).

Paper Structure

This paper contains 26 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Top: ViT use pure Transformer block. Middle: The state-of-the-art ViT-Variants use CNN branch and Transformer branch. Bottom: CTA-Net use CNN-Transformer Aggregation Network, aggregate CNN into Transformer to take full advantage of the advantages of both.
  • Figure 2: (a) illustrates the overall architecture of CTA-Net, highlighting the central CNN-Transformer (CT) Block, which integrates CNNs with transformers for enhanced feature extraction. (b1) depicts the LMF-MHSA module, showcasing the Lightweight Multi-Scale Feature Fusion Multi-Head Self-Attention mechanism, which efficiently learns multi-scale features while reducing computational complexity. (b2) provides a detailed view of the Multi-Scale Conv operation, demonstrating how different convolution kernel sizes are used to extract multi-scale features from the input. (c1) illustrates the RRCV module, the Reverse Reconstruction CNN-Variants module, designed to embed CNN operations within the Transformer architecture, leveraging the strengths of both CNNs and transformers. (c2) offers a detailed view of the Reconstruction operation process, highlighting how local features extracted by CNNs are seamlessly integrated into the Transformer's global context.
  • Figure 3: Improvement of CTA-Net over CNN-Variants, ViT-Variants and ViT-Aggregation Model. Circles of different colors represent different models. The closer to the lower left corner, the smaller the model parameters and the higher the efficiency. The red circle representing CTA-Net is closest to the lower left corner, and the model is the lightest and most efficient.