AudioRepInceptionNeXt: A lightweight single-stream architecture for efficient audio recognition

Kin Wai Lau; Yasar Abbas Ur Rehman; Lai-Man Po

AudioRepInceptionNeXt: A lightweight single-stream architecture for efficient audio recognition

Kin Wai Lau, Yasar Abbas Ur Rehman, Lai-Man Po

TL;DR

AudioRepInceptionNeXt tackles the challenge of deploying audio recognition on edge devices by replacing multi-stream CNNs with a lightweight single-stream design that uses parallel multi-scale depthwise separable kernels. The model trains with multi-branch kernels to capture global and local temporal-frequency information, and employs a reparameterization procedure to fuse these branches into a fast single-branch kernel at inference, maintaining accuracy. Empirically, it achieves similar or better accuracy than state-of-the-art CNNs while reducing parameters and GFLOPs by over 50% and boosting inference speed (e.g., up to $1.28\times$ on GPU) and mobile efficiency. The approach demonstrates strong transferability across diverse audio tasks, with robust performance on pretraining and downstream datasets, and practical viability for mobile deployment.

Abstract

Recent research has successfully adapted vision-based convolutional neural network (CNN) architectures for audio recognition tasks using Mel-Spectrograms. However, these CNNs have high computational costs and memory requirements, limiting their deployment on low-end edge devices. Motivated by the success of efficient vision models like InceptionNeXt and ConvNeXt, we propose AudioRepInceptionNeXt, a single-stream architecture. Its basic building block breaks down the parallel multi-branch depth-wise convolutions with descending scales of k x k kernels into a cascade of two multi-branch depth-wise convolutions. The first multi-branch consists of parallel multi-scale 1 x k depth-wise convolutional layers followed by a similar multi-branch employing parallel multi-scale k x 1 depth-wise convolutional layers. This reduces computational and memory footprint while separating time and frequency processing of Mel-Spectrograms. The large kernels capture global frequencies and long activities, while small kernels get local frequencies and short activities. We also reparameterize the multi-branch design during inference to further boost speed without losing accuracy. Experiments show that AudioRepInceptionNeXt reduces parameters and computations by 50%+ and improves inference speed 1.28x over state-of-the-art CNNs like the Slow-Fast while maintaining comparable accuracy. It also learns robustly across a variety of audio recognition tasks. Codes are available at https://github.com/StevenLauHKHK/AudioRepInceptionNeXt.

AudioRepInceptionNeXt: A lightweight single-stream architecture for efficient audio recognition

TL;DR

on GPU) and mobile efficiency. The approach demonstrates strong transferability across diverse audio tasks, with robust performance on pretraining and downstream datasets, and practical viability for mobile deployment.

Abstract

Paper Structure (29 sections, 7 equations, 4 figures, 6 tables)

This paper contains 29 sections, 7 equations, 4 figures, 6 tables.

Introduction
Related Work
Single-Stream Architecture to Multi-Stream Architecture
Model Re-parameterization
Methodology
Model Architecture
Model Input
Marco Design
AudioRepInceptionNeXt Block
Parallel multi-scale kernel
Depthwise Separable Kernel
Inverted Bottleneck
Identity shortcut
Reparameterization for Inference time model
Comparison to Multi-stream Slow-Fast model
...and 14 more sections

Figures (4)

Figure 1: Comparison of the Top-1 accuracy and GFLOPs on VGG Sounds. Different markers represent different baseline backbone architectures.
Figure 2: (a) Architecture of the Slow-Fast Model (Upper) and AudioRepInceptionNeXt (Bottom); (b) AudioRepInceptionNeXt Block during the training (Left) and Structural Re-parameterization of AudioRepInceptionNeXt Block after the training (Right); AudioRepInceptionNeXt (2D) (Right side of the dotted line) DW-Conv represents the depth-wise convolution and other notations can be found in Section \ref{['sec:methodology']}.
Figure 3: Re-parameterization of the horizontal multi-scale kernel in the AudioRepInceptionNeXt Block. Here we assume all the layers have the same number of input channels, output channels, and stride size.
Figure 4: Mobile runtime application development flow by using the ONNX Runtime library.

AudioRepInceptionNeXt: A lightweight single-stream architecture for efficient audio recognition

TL;DR

Abstract

AudioRepInceptionNeXt: A lightweight single-stream architecture for efficient audio recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (4)