ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie
TL;DR
ConvNeXt V2 tackles the mismatch between ConvNets and masked autoencoding by co-designing a fully convolutional MAE (FCMAE) with a sparse-ConvNeXt encoder and a lightweight decoder, complemented by a Global Response Normalization layer to prevent feature collapse. The approach enables effective self-supervised pretraining for pure ConvNets, achieving strong ImageNet, COCO, and ADE20K performance and setting new public-data benchmarks, including an 88.9% top-1 on ImageNet-1K for the Huge model with IN-22K intermediate fine-tuning. Key contributions include the FCMAE framework, the GRN module, and a broad model family (Atto to Huge) demonstrating scalable gains across tasks. This work shows that carefully aligned architectural design and self-supervised objectives can unleash competitive, scalable performance for ConvNets, closing gaps with transformer-based methods on several benchmarks.
Abstract
Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.
