Table of Contents
Fetching ...

InceptionNeXt: When Inception Meets ConvNeXt

Weihao Yu, Pan Zhou, Shuicheng Yan, Xinchao Wang

TL;DR

This work addresses the efficiency bottlenecks of large-kernel CNNs by decomposing large depthwise convolutions into an Inception-style multi-branch scheme, enabling faster training and inference without sacrificing accuracy. It introduces MetaNeXt as a lightweight block framework and instantiates InceptionNeXt with four-branch Inception depthwise convolutions, yielding substantial speedups over ConvNeXt while maintaining competitive accuracy. Across ImageNet-1K and ADE20K, InceptionNeXt demonstrates superior speed–accuracy trade-offs, especially in smaller models, and proves robust across classification and dense-prediction tasks. The method offers a practical CNN baseline for resource-conscious vision applications and emphasizes design choices that reduce computational and memory bottlenecks.

Abstract

Inspired by the long-range modeling ability of ViTs, large-kernel convolutions are widely studied and adopted recently to enlarge the receptive field and improve model performance, like the remarkable work ConvNeXt which employs 7x7 depthwise convolution. Although such depthwise operator only consumes a few FLOPs, it largely harms the model efficiency on powerful computing devices due to the high memory access costs. For example, ConvNeXt-T has similar FLOPs with ResNet-50 but only achieves ~60% throughputs when trained on A100 GPUs with full precision. Although reducing the kernel size of ConvNeXt can improve speed, it results in significant performance degradation, which poses a challenging problem: How to speed up large-kernel-based CNN models while preserving their performance. To tackle this issue, inspired by Inceptions, we propose to decompose large-kernel depthwise convolution into four parallel branches along channel dimension, i.e., small square kernel, two orthogonal band kernels, and an identity mapping. With this new Inception depthwise convolution, we build a series of networks, namely IncepitonNeXt, which not only enjoy high throughputs but also maintain competitive performance. For instance, InceptionNeXt-T achieves 1.6x higher training throughputs than ConvNeX-T, as well as attains 0.2% top-1 accuracy improvement on ImageNet-1K. We anticipate InceptionNeXt can serve as an economical baseline for future architecture design to reduce carbon footprint. Code is available at https://github.com/sail-sg/inceptionnext.

InceptionNeXt: When Inception Meets ConvNeXt

TL;DR

This work addresses the efficiency bottlenecks of large-kernel CNNs by decomposing large depthwise convolutions into an Inception-style multi-branch scheme, enabling faster training and inference without sacrificing accuracy. It introduces MetaNeXt as a lightweight block framework and instantiates InceptionNeXt with four-branch Inception depthwise convolutions, yielding substantial speedups over ConvNeXt while maintaining competitive accuracy. Across ImageNet-1K and ADE20K, InceptionNeXt demonstrates superior speed–accuracy trade-offs, especially in smaller models, and proves robust across classification and dense-prediction tasks. The method offers a practical CNN baseline for resource-conscious vision applications and emphasizes design choices that reduce computational and memory bottlenecks.

Abstract

Inspired by the long-range modeling ability of ViTs, large-kernel convolutions are widely studied and adopted recently to enlarge the receptive field and improve model performance, like the remarkable work ConvNeXt which employs 7x7 depthwise convolution. Although such depthwise operator only consumes a few FLOPs, it largely harms the model efficiency on powerful computing devices due to the high memory access costs. For example, ConvNeXt-T has similar FLOPs with ResNet-50 but only achieves ~60% throughputs when trained on A100 GPUs with full precision. Although reducing the kernel size of ConvNeXt can improve speed, it results in significant performance degradation, which poses a challenging problem: How to speed up large-kernel-based CNN models while preserving their performance. To tackle this issue, inspired by Inceptions, we propose to decompose large-kernel depthwise convolution into four parallel branches along channel dimension, i.e., small square kernel, two orthogonal band kernels, and an identity mapping. With this new Inception depthwise convolution, we build a series of networks, namely IncepitonNeXt, which not only enjoy high throughputs but also maintain competitive performance. For instance, InceptionNeXt-T achieves 1.6x higher training throughputs than ConvNeX-T, as well as attains 0.2% top-1 accuracy improvement on ImageNet-1K. We anticipate InceptionNeXt can serve as an economical baseline for future architecture design to reduce carbon footprint. Code is available at https://github.com/sail-sg/inceptionnext.
Paper Structure (17 sections, 7 equations, 4 figures, 10 tables, 1 algorithm)

This paper contains 17 sections, 7 equations, 4 figures, 10 tables, 1 algorithm.

Figures (4)

  • Figure 1: Trade-off between accuracy and training throughput. All models are trained under the DeiT training hyperparameters deitswinconvnextresnetsb. The training throughput is measured on an A100 GPU with batch size of 128. ConvNeXt-T/k$n$ means variants with depthwise convolution kernel size of $n \times n$. InceptionNeXt-T enjoys both ResNet-50's speed and ConvNeXt-T's accuracy.
  • Figure 2: Block illustration of MetaFormer, MetaNext, ConvNeXt and InceptionNeXt. Similar to MetaFormer block metaformer, MetaNeXt is a general block abstracted from ConvNeXt convnext. MetaNeXt can be regarded as a simpler version obtained from MetaFormer by merging two residual sub-blocks into one. It is worth noting that the token mixer used in MetaNeXt cannot be too complex (e.g., self-attention transformer) or it may fail to train to converge. By specifying the token mixer as depthwise convolution or Inception depthwise convolution, the model is instantiated as ConvNeXt or InceptionNeXt block. Compared with ConvNeXt, InceptionNeXt is more efficient because it decomposes expensive large-kernel depthwise convolution into four efficient parallel branches.
  • Figure 3: Comparison of FLOPs between depthwise convolution and Inception depthwise convolution. Inception depthwise convolution is much more efficient than depthwise convolution as kernel size increases.
  • Figure 4: Grad-CAM gradcam activation maps of different models trained on ImageNet-1K. The visualized images are from the validation set of ImageNet-1K.