Designing Concise ConvNets with Columnar Stages
Ashish Kumar, Jaesik Park
TL;DR
CoSNet introduces a concise convolutional architecture built on Parallel Columnar Convolutions, Input Replication, and shallow/deep projections to achieve low depth, controlled parameter growth, and high computational density. By enforcing uniform kernel sizes and delaying fusion (Fuse-Once), CoSNet attains strong efficiency without relying on attention mechanisms. Across ImageNet and downstream tasks, CoSNet matches or surpasses many standard ConvNets and ViTs while using markedly fewer parameters and FLOPs, and with faster inference. This work highlights a practical path toward efficient CNNs that compete with transformer-based models in real-world deployment scenarios.
Abstract
In the era of vision Transformers, the recent success of VanillaNet shows the huge potential of simple and concise convolutional neural networks (ConvNets). Where such models mainly focus on runtime, it is also crucial to simultaneously focus on other aspects, e.g., FLOPs, parameters, etc, to strengthen their utility further. To this end, we introduce a refreshing ConvNet macro design called Columnar Stage Network (CoSNet). CoSNet has a systematically developed simple and concise structure, smaller depth, low parameter count, low FLOPs, and attention-less operations, well suited for resource-constrained deployment. The key novelty of CoSNet is deploying parallel convolutions with fewer kernels fed by input replication, using columnar stacking of these convolutions, and minimizing the use of 1x1 convolution layers. Our comprehensive evaluations show that CoSNet rivals many renowned ConvNets and Transformer designs under resource-constrained scenarios. Code: https://github.com/ashishkumar822/CoSNet
