Table of Contents
Fetching ...

DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs

Donghyun Kim, Byeongho Heo, Dongyoon Han

TL;DR

DenseNets have been historically outpaced by residual and Transformer-based architectures due to design and training limitations. This work revisits DenseNets, arguing that concatenation shortcuts can yield higher representational rank, and offers a modernized RDNet design with wider, shallower blocks, improved feature mixers, larger intermediate dimensions, and patch-based stems. Through a comprehensive pilot study of thousands of random networks and extensive ImageNet-1K/downstream task evaluations, RDNet demonstrates competitive or superior speed-accuracy trade-offs compared to state-of-the-art models, with strong performance on ADE20K and COCO as well. The findings suggest concatenation-based DenseNet designs can complement ResNet- and ViT-based paradigms, offering practical advantages in memory efficiency and robustness across resolutions. Code and models are provided to foster further exploration of DenseNet-style architectures in modern vision tasks.

Abstract

This paper revives Densely Connected Convolutional Networks (DenseNets) and reveals the underrated effectiveness over predominant ResNet-style architectures. We believe DenseNets' potential was overlooked due to untouched training methods and traditional design elements not fully revealing their capabilities. Our pilot study shows dense connections through concatenation are strong, demonstrating that DenseNets can be revitalized to compete with modern architectures. We methodically refine suboptimal components - architectural adjustments, block redesign, and improved training recipes towards widening DenseNets and boosting memory efficiency while keeping concatenation shortcuts. Our models, employing simple architectural elements, ultimately surpass Swin Transformer, ConvNeXt, and DeiT-III - key architectures in the residual learning lineage. Furthermore, our models exhibit near state-of-the-art performance on ImageNet-1K, competing with the very recent models and downstream tasks, ADE20k semantic segmentation, and COCO object detection/instance segmentation. Finally, we provide empirical analyses that uncover the merits of the concatenation over additive shortcuts, steering a renewed preference towards DenseNet-style designs. Our code is available at https://github.com/naver-ai/rdnet.

DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs

TL;DR

DenseNets have been historically outpaced by residual and Transformer-based architectures due to design and training limitations. This work revisits DenseNets, arguing that concatenation shortcuts can yield higher representational rank, and offers a modernized RDNet design with wider, shallower blocks, improved feature mixers, larger intermediate dimensions, and patch-based stems. Through a comprehensive pilot study of thousands of random networks and extensive ImageNet-1K/downstream task evaluations, RDNet demonstrates competitive or superior speed-accuracy trade-offs compared to state-of-the-art models, with strong performance on ADE20K and COCO as well. The findings suggest concatenation-based DenseNet designs can complement ResNet- and ViT-based paradigms, offering practical advantages in memory efficiency and robustness across resolutions. Code and models are provided to foster further exploration of DenseNet-style architectures in modern vision tasks.

Abstract

This paper revives Densely Connected Convolutional Networks (DenseNets) and reveals the underrated effectiveness over predominant ResNet-style architectures. We believe DenseNets' potential was overlooked due to untouched training methods and traditional design elements not fully revealing their capabilities. Our pilot study shows dense connections through concatenation are strong, demonstrating that DenseNets can be revitalized to compete with modern architectures. We methodically refine suboptimal components - architectural adjustments, block redesign, and improved training recipes towards widening DenseNets and boosting memory efficiency while keeping concatenation shortcuts. Our models, employing simple architectural elements, ultimately surpass Swin Transformer, ConvNeXt, and DeiT-III - key architectures in the residual learning lineage. Furthermore, our models exhibit near state-of-the-art performance on ImageNet-1K, competing with the very recent models and downstream tasks, ADE20k semantic segmentation, and COCO object detection/instance segmentation. Finally, we provide empirical analyses that uncover the merits of the concatenation over additive shortcuts, steering a renewed preference towards DenseNet-style designs. Our code is available at https://github.com/naver-ai/rdnet.
Paper Structure (48 sections, 10 figures, 21 tables)

This paper contains 48 sections, 10 figures, 21 tables.

Figures (10)

  • Figure 1: Schematic illustration of RDNet. RDNet features a unique design distinguishing it from ResNet-style architectures, primarily due to the use of feature concatenation. We design four stages in RDNet across all scales, where each stage-N comprises $L_N$ mixing blocks consisting of three feature mixers and one transition layer (the last mixing block does not employ the transition layer). Feature mixer $f$ denotes our building block combines previously concatenated features to compress them into GR-dimensional features for concatenation. The growth rate (GR) adjusts the amount of concatenated features and is predetermined for each stage. Transition layers for downsampling are positioned after each stage as before. S and C denote stride and channel size. This figure illustratively sets GR to two.
  • Figure 2: ImageNet-1K performance trade-off among state-of-the-arts. We provide comparative visualizations among state-of-the-art models, which were known for top-performing models. It turns out that RDNet is highly competitive in practice in terms of model speed and memory consumption.
  • Figure 3: ImageNet-1K performance trade-off among previous milestones. We provide comparative visualizations between previous architectures and our models. Notice that we also include speed comparisons to highlight actual differences in practice. Our models outperform the competing modern architectures revealing the potential of feature concatenation in designing networks.
  • Figure 4: Cumulative probability vs. error of trained models in Table \ref{['tab:randnet_setups_results']} is visualized here following Radosavovic et al. cvpr2020regnet. Across all scales and settings, we observe concatenation-based models outperform those employing additive shortcuts.
  • Figure 5: Accuracy/latency/memory vs. resolution. RDNet enjoys resolution-robustness against various input image sizes to maintain accuracy. Furthermore, RDNet exhibits a similar latency/memory trend to ConvNeXt and Swin Transformer, maintaining minimal increase with larger images compared to DeiT-S and DenseNet161.
  • ...and 5 more figures