Table of Contents
Fetching ...

Dilated Convolution with Learnable Spacings

Ismail Khalfaoui-Hassani

TL;DR

This work introduces Dilated Convolution with Learnable Spacings (DCLS), a differentiable framework that learns kernel element positions within a dilated convolution via interpolation to overcome non-differentiability. By exploring bilinear and Gaussian interpolations, the method achieves state-of-the-art-like gains on image classification benchmarks and downstream tasks, while enabling large receptive fields without sacrificing resolution or excessive parameter growth. The approach extends to audio and Spiking Neural Networks, where DCLS enables learnable temporal delays that yield superior performance on temporally rich datasets, including AudioSet and spiking speech benchmarks. The results suggest DCLS as a versatile reparameterization technique for large kernels, with potential for dilated-attention hybrids and neuromorphic applications, and highlight considerations for efficiency, convergence, and explainability in future work.

Abstract

This thesis presents and evaluates the Dilated Convolution with Learnable Spacings (DCLS) method. Through various supervised learning experiments in the fields of computer vision, audio, and speech processing, the DCLS method proves to outperform both standard and advanced convolution techniques. The research is organized into several steps, starting with an analysis of the literature and existing convolution techniques that preceded the development of the DCLS method. We were particularly interested in the methods that are closely related to our own and that remain essential to capture the nuances and uniqueness of our approach. The cornerstone of our study is the introduction and application of the DCLS method to convolutional neural networks (CNNs), as well as to hybrid architectures that rely on both convolutional and visual attention approaches. DCLS is shown to be particularly effective in tasks such as classification, semantic segmentation, and object detection. Initially using bilinear interpolation, the study also explores other interpolation methods, finding that Gaussian interpolation slightly improves performance. The DCLS method is further applied to spiking neural networks (SNNs) to enable synaptic delay learning within a neural network that could eventually be transferred to so-called neuromorphic chips. The results show that the DCLS method stands out as a new state-of-the-art technique in SNN audio classification for certain benchmark tasks in this field. These tasks involve datasets with a high temporal component. In addition, we show that DCLS can significantly improve the accuracy of artificial neural networks for the multi-label audio classification task. We conclude with a discussion of the chosen experimental setup, its limitations, the limitations of our method, and our results.

Dilated Convolution with Learnable Spacings

TL;DR

This work introduces Dilated Convolution with Learnable Spacings (DCLS), a differentiable framework that learns kernel element positions within a dilated convolution via interpolation to overcome non-differentiability. By exploring bilinear and Gaussian interpolations, the method achieves state-of-the-art-like gains on image classification benchmarks and downstream tasks, while enabling large receptive fields without sacrificing resolution or excessive parameter growth. The approach extends to audio and Spiking Neural Networks, where DCLS enables learnable temporal delays that yield superior performance on temporally rich datasets, including AudioSet and spiking speech benchmarks. The results suggest DCLS as a versatile reparameterization technique for large kernels, with potential for dilated-attention hybrids and neuromorphic applications, and highlight considerations for efficiency, convergence, and explainability in future work.

Abstract

This thesis presents and evaluates the Dilated Convolution with Learnable Spacings (DCLS) method. Through various supervised learning experiments in the fields of computer vision, audio, and speech processing, the DCLS method proves to outperform both standard and advanced convolution techniques. The research is organized into several steps, starting with an analysis of the literature and existing convolution techniques that preceded the development of the DCLS method. We were particularly interested in the methods that are closely related to our own and that remain essential to capture the nuances and uniqueness of our approach. The cornerstone of our study is the introduction and application of the DCLS method to convolutional neural networks (CNNs), as well as to hybrid architectures that rely on both convolutional and visual attention approaches. DCLS is shown to be particularly effective in tasks such as classification, semantic segmentation, and object detection. Initially using bilinear interpolation, the study also explores other interpolation methods, finding that Gaussian interpolation slightly improves performance. The DCLS method is further applied to spiking neural networks (SNNs) to enable synaptic delay learning within a neural network that could eventually be transferred to so-called neuromorphic chips. The results show that the DCLS method stands out as a new state-of-the-art technique in SNN audio classification for certain benchmark tasks in this field. These tasks involve datasets with a high temporal component. In addition, we show that DCLS can significantly improve the accuracy of artificial neural networks for the multi-label audio classification task. We conclude with a discussion of the chosen experimental setup, its limitations, the limitations of our method, and our results.
Paper Structure (116 sections, 93 equations, 31 figures, 12 tables, 3 algorithms)

This paper contains 116 sections, 93 equations, 31 figures, 12 tables, 3 algorithms.

Figures (31)

  • Figure 1: Classification accuracy on ImageNet-1K as a function of latency (i.e. inverse of the throughput). Dot diameter corresponds to the number of parameters. Models represented are: CaFormer yu2022metaformer, ConvNeXt liu2022convnet, DilateFormer jiao2023dilateformer, FastVit vasu2023fastvit, InceptioNext yu2023inceptionnext, InternImage wang2022internimage, ParCNetv2 xu2023parcnetv2, ReplKNet-31B ding2022scaling.
  • Figure 2: Classification accuracy on ImageNet-1K as a function of latency (i.e. inverse of the throughput). Dot diameter corresponds to the number of parameters.
  • Figure 3: (a): a standard $3\times 3$ kernel. (b): a dilated $3\times 3$ kernel with dilation rate 4. (c): a 2D-DCLS kernel with 9 kernel elements and a dilated kernel size of 9. Each weight is spread over up to four adjacent pixels. (d): a 2D-DCLS kernel with 3 kernel elements and still a dilated kernel size of 9.
  • Figure 4: The distribution over epochs of kernel positions for the four stages of the ConvNeXt-T-dcls model.
  • Figure 5: Kernel positions distribution - stage 0.
  • ...and 26 more figures