Table of Contents
Fetching ...

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie

TL;DR

ConvNeXt V2 tackles the mismatch between ConvNets and masked autoencoding by co-designing a fully convolutional MAE (FCMAE) with a sparse-ConvNeXt encoder and a lightweight decoder, complemented by a Global Response Normalization layer to prevent feature collapse. The approach enables effective self-supervised pretraining for pure ConvNets, achieving strong ImageNet, COCO, and ADE20K performance and setting new public-data benchmarks, including an 88.9% top-1 on ImageNet-1K for the Huge model with IN-22K intermediate fine-tuning. Key contributions include the FCMAE framework, the GRN module, and a broad model family (Atto to Huge) demonstrating scalable gains across tasks. This work shows that carefully aligned architectural design and self-supervised objectives can unleash competitive, scalable performance for ConvNets, closing gaps with transformer-based methods on several benchmarks.

Abstract

Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

TL;DR

ConvNeXt V2 tackles the mismatch between ConvNets and masked autoencoding by co-designing a fully convolutional MAE (FCMAE) with a sparse-ConvNeXt encoder and a lightweight decoder, complemented by a Global Response Normalization layer to prevent feature collapse. The approach enables effective self-supervised pretraining for pure ConvNets, achieving strong ImageNet, COCO, and ADE20K performance and setting new public-data benchmarks, including an 88.9% top-1 on ImageNet-1K for the Huge model with IN-22K intermediate fine-tuning. Key contributions include the FCMAE framework, the GRN module, and a broad model family (Atto to Huge) demonstrating scalable gains across tasks. This work shows that carefully aligned architectural design and self-supervised objectives can unleash competitive, scalable performance for ConvNets, closing gaps with transformer-based methods on several benchmarks.

Abstract

Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.
Paper Structure (46 sections, 3 equations, 8 figures, 16 tables, 1 algorithm)

This paper contains 46 sections, 3 equations, 8 figures, 16 tables, 1 algorithm.

Figures (8)

  • Figure 1: ConvNeXt V2 model scaling. The ConvNeXt V2 model, which has been pre-trained using our fully convolutional masked autoencoder framework, performs significantly better than the previous version across a wide range of model sizes.
  • Figure 2: Our FCMAE framework. We introduce a fully convolutional masked autoencoder (FCMAE). It consists of a sparse convolution-based ConvNeXt encoder and a lightweight ConvNeXt block decoder. Overall, the architecture of our autoencoder is asymmetric. The encoder processes only the visible pixels, and the decoder reconstructs the image using the encoded pixels and mask tokens. The loss is calculated only on the masked region.
  • Figure 3: Feature activation visualization. We visualize the activation map for each feature channel in small squares. For clarity, we display 64 channels in each visualization. The ConvNeXt V1 model suffers from a feature collapse issue, which is characterized by the presence of redundant activations (dead or saturated neurons) across channels. To fix this problem, we introduce a new method to promote feature diversity during training: the global response normalization (GRN) layer. This technique is applied to high-dimensional features in every block, leading to the development of the ConvNeXt V2 architecture.
  • Figure 4: Feature cosine distance analysis. As the number of total layers varies for different architectures, we plot the distance values against the normalized layer indexes. We observe that the ConvNeXt V1 FCMAE pre-trained model exhibits severe feature collapse behavior. The supervised model also shows a reduction in feature diversity, but only in the final layers. This decrease in diversity in the supervised model is likely due to the use of the cross-entropy loss, which encourages the model to focus on class-discriminative features while suppressing the others.
  • Figure 5: ConvNeXt Block Designs. In ConvNeXt V2, we add the GRN layer after the dimension-expansion MLP layer and drop LayerScale touvron2021going as it becomes redundant.
  • ...and 3 more figures