Table of Contents
Fetching ...

KernelWarehouse: Rethinking the Design of Dynamic Convolution

Chao Li, Anbang Yao

TL;DR

KernelWarehouse rethinks dynamic convolution by introducing kernel partition, cross-layer warehouse sharing, and a contrasting-driven attention function to enable large kernel counts ($n$) under a fixed parameter budget. By partitioning kernels into local cells, sharing a large warehouse across layers, and using CAF to learn diverse, sometimes negative attentions, it achieves substantial accuracy gains with substantially fewer parameters than traditional dynamic conv methods. The approach yields state-of-the-art results on ImageNet and MS-COCO across multiple backbones, including Vision Transformers, and demonstrates favorable speed and memory characteristics. This work offers a practical path to high-capacity, parameter-efficient dynamic convolution and suggests broader applicability to modern architectures.

Abstract

Dynamic convolution learns a linear mixture of n static kernels weighted with their input-dependent attentions, demonstrating superior performance than normal convolution. However, it increases the number of convolutional parameters by n times, and thus is not parameter efficient. This leads to no research progress that can allow researchers to explore the setting n>100 (an order of magnitude larger than the typical setting n<10) for pushing forward the performance boundary of dynamic convolution while enjoying parameter efficiency. To fill this gap, in this paper, we propose KernelWarehouse, a more general form of dynamic convolution, which redefines the basic concepts of ``kernels", ``assembling kernels" and ``attention function" through the lens of exploiting convolutional parameter dependencies within the same layer and across neighboring layers of a ConvNet. We testify the effectiveness of KernelWarehouse on ImageNet and MS-COCO datasets using various ConvNet architectures. Intriguingly, KernelWarehouse is also applicable to Vision Transformers, and it can even reduce the model size of a backbone while improving the model accuracy. For instance, KernelWarehouse (n=4) achieves 5.61%|3.90%|4.38% absolute top-1 accuracy gain on the ResNet18|MobileNetV2|DeiT-Tiny backbone, and KernelWarehouse (n=1/4) with 65.10% model size reduction still achieves 2.29% gain on the ResNet18 backbone. The code and models are available at https://github.com/OSVAI/KernelWarehouse.

KernelWarehouse: Rethinking the Design of Dynamic Convolution

TL;DR

KernelWarehouse rethinks dynamic convolution by introducing kernel partition, cross-layer warehouse sharing, and a contrasting-driven attention function to enable large kernel counts () under a fixed parameter budget. By partitioning kernels into local cells, sharing a large warehouse across layers, and using CAF to learn diverse, sometimes negative attentions, it achieves substantial accuracy gains with substantially fewer parameters than traditional dynamic conv methods. The approach yields state-of-the-art results on ImageNet and MS-COCO across multiple backbones, including Vision Transformers, and demonstrates favorable speed and memory characteristics. This work offers a practical path to high-capacity, parameter-efficient dynamic convolution and suggests broader applicability to modern architectures.

Abstract

Dynamic convolution learns a linear mixture of n static kernels weighted with their input-dependent attentions, demonstrating superior performance than normal convolution. However, it increases the number of convolutional parameters by n times, and thus is not parameter efficient. This leads to no research progress that can allow researchers to explore the setting n>100 (an order of magnitude larger than the typical setting n<10) for pushing forward the performance boundary of dynamic convolution while enjoying parameter efficiency. To fill this gap, in this paper, we propose KernelWarehouse, a more general form of dynamic convolution, which redefines the basic concepts of ``kernels", ``assembling kernels" and ``attention function" through the lens of exploiting convolutional parameter dependencies within the same layer and across neighboring layers of a ConvNet. We testify the effectiveness of KernelWarehouse on ImageNet and MS-COCO datasets using various ConvNet architectures. Intriguingly, KernelWarehouse is also applicable to Vision Transformers, and it can even reduce the model size of a backbone while improving the model accuracy. For instance, KernelWarehouse (n=4) achieves 5.61%|3.90%|4.38% absolute top-1 accuracy gain on the ResNet18|MobileNetV2|DeiT-Tiny backbone, and KernelWarehouse (n=1/4) with 65.10% model size reduction still achieves 2.29% gain on the ResNet18 backbone. The code and models are available at https://github.com/OSVAI/KernelWarehouse.
Paper Structure (21 sections, 4 equations, 18 figures, 17 tables)

This paper contains 21 sections, 4 equations, 18 figures, 17 tables.

Figures (18)

  • Figure 1: A schematic overview of KernelWarehouse to a ConvNet. As a more general form of dynamic convolution, KernelWarehouse consists of three interdependent components, namely kernel partition, warehouse construction-with-sharing and contrasting-driven attention function (CAF), which redefine the basic concepts of "kernels", "assembling kernels" and "attention function" in the perspective of exploiting convolutional parameter dependencies within the same layer and across neighboring layers of a ConvNet, enabling to use significantly large kernel number settings (e.g., $n>100$) while enjoying improved model accuracy and parameter efficiency. Please see the Method section for the detailed formulation.
  • Figure 2: An illustration of kernel partition and warehouse construction-with-sharing across three same-stage convolutional layers of a ConvNet. $cdd$ denotes common kernel dimension divisors, and $b$ is the desired convolutional parameter budget.
  • Figure 3: Visualization of statistical mean values of learnt attention $\alpha_{ij}$ in each warehouse. The results are obtained from the pre-trained ResNet18 model with KW ($1\times$) using the whole ImageNet validation set. Best viewed with zoom-in.
  • Figure 4: A visualization example of attentions initialization strategy for KW ($1\times$), where both $n$ and $m_{t}$ equal to 6. It helps the ConvNet to build one-to-one relationships between kernel cells and linear mixtures in the early training stage according to our setting of $\beta_{ij}$. $\mathbf{e}_{z}$ is a kernel cell that doesn't really exist and it keeps as a zero matrix constantly. In the beginning of the training process when temperature $\tau$ is 1, a ConvNet built with KW ($1\times$) can be roughly seen as a ConvNet with standard convolutions.
  • Figure 5: Visualization examples of attentions initialization strategies for KW ($2\times$), where $n=4$ and $m_{t}=2$. (a) our proposed strategy builds one-to-one relationships between kernel cells and linear mixtures; (b) an alternative strategy which builds two-to-one relationships between kernel cells and linear mixtures.
  • ...and 13 more figures