Table of Contents
Fetching ...

Multi-Convformer: Extending Conformer with Multiple Convolution Kernels

Darshan Prabhu, Yifan Peng, Preethi Jyothi, Shinji Watanabe

TL;DR

The paper tackles limited local-context modeling in Conformer by introducing Multi-Convformer, which replaces a fixed convolution with a MultiConv module comprising multiple depthwise kernels and gating. The M-CSGU block fuses outputs from kernels of sizes $K=\{7,15,23,31\}$ via various Fusion schemes, with depth-based fusion showing strong results. Across ASR and SLU benchmarks, including Librispeech, Tedlium-2, AISHELL, and SLURP, Multi-Convformer achieves up to 8% relative WER reduction and maintains competitive parameter efficiency, outperforming or matching Conformer variants like CgMLP and E-Branchformer. The work also provides kernel- and diagonality analyses to interpret how multi-kernel convolutions affect local vs global context in encoder self-attention, highlighting practical guidelines for kernel choices and fusion strategies.

Abstract

Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition~(ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itself has been far less explored. Towards this, we introduce Multi-Convformer that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. This helps in improved modeling of local dependencies at varying granularities. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate~(WER) improvements.

Multi-Convformer: Extending Conformer with Multiple Convolution Kernels

TL;DR

The paper tackles limited local-context modeling in Conformer by introducing Multi-Convformer, which replaces a fixed convolution with a MultiConv module comprising multiple depthwise kernels and gating. The M-CSGU block fuses outputs from kernels of sizes via various Fusion schemes, with depth-based fusion showing strong results. Across ASR and SLU benchmarks, including Librispeech, Tedlium-2, AISHELL, and SLURP, Multi-Convformer achieves up to 8% relative WER reduction and maintains competitive parameter efficiency, outperforming or matching Conformer variants like CgMLP and E-Branchformer. The work also provides kernel- and diagonality analyses to interpret how multi-kernel convolutions affect local vs global context in encoder self-attention, highlighting practical guidelines for kernel choices and fusion strategies.

Abstract

Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition~(ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itself has been far less explored. Towards this, we introduce Multi-Convformer that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. This helps in improved modeling of local dependencies at varying granularities. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate~(WER) improvements.
Paper Structure (10 sections, 3 equations, 4 figures, 5 tables)

This paper contains 10 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of our MultiConv encoder layer. It comprises a stacked architecture similar to Conformer, except the convolution block is replaced with a gated multi-kernel convolution block. We omit residual connections and layer norm for readability.
  • Figure 2: Overview of the four variants of the Fusion() operation in our proposed Multi-kernel Convolutional Spatial Gating Unit (M-CSGU) block that demonstrates how convolution kernels with $K=\{7,15,23,31\}$ are used in our MultiConv block. Briefly, the four architectures are as follows: In (a) element-wise addition is employed on the outputs of the convolutions to generate the final output. (b) builds upon (a) by learning weights that determine the importance of each convolution's output. (c) uses compression kernels to reduce the input by $1/4^{\text{th}}$ and then concatenates all of them frame-wise to generate the output. (d) further enhances (c) by using an additional depthwise convolution at the end that takes neighboring frames into account when generating the final output.
  • Figure 3: Line plot showing the degree of diagonality seen in each encoder layer's self-attention block. A higher value indicates that the attention weight matrix is more concentrated along its diagonal attnetion_usefullness.
  • Figure 4: Heatmap showing the importance given to each convolution kernel across all encoder layers. A dark blue cell indicates a high level of importance given to a specific kernel.