Table of Contents
Fetching ...

Deep Space Separable Distillation for Lightweight Acoustic Scene Classification

ShuQi Ye, Yuan Tian

TL;DR

This paper tackles acoustic scene classification (ASC) with the goal of achieving high accuracy while remaining lightweight for practical deployment. It introduces the Deep Space Separable Distillation Network (DSSDN), built from Deep Space Separable Operators (DSSO) and Deep Space Separable Distilled Blocks (DSSDB), and incorporates a log-Mel frequency-axis cutting strategy to emphasize informative low-frequency features. Three lightweight operators—Separable Convolution (SC), Orthonormal Separable Convolution (OSC), and Separable Partial Convolution (SPC)—drive three DSSDN variants (Large, Middle, Small), which substantially reduce parameters and MACs yet maintain competitive accuracy on the TAU Urban Acoustic Scenes 2020 Mobile dataset. Ablation studies confirm the contributions of both the DSSDB distillation blocks and the DSSO components. Overall, the approach achieves strong ASC performance with sub-1M parameter counts and sub-GMACs, enabling efficient deployment in real-world audio systems, with reported gains around 9.8 percentage points over mainstream baselines.

Abstract

Acoustic scene classification (ASC) is highly important in the real world. Recently, deep learning-based methods have been widely employed for acoustic scene classification. However, these methods are currently not lightweight enough as well as their performance is not satisfactory. To solve these problems, we propose a deep space separable distillation network. Firstly, the network performs high-low frequency decomposition on the log-mel spectrogram, significantly reducing computational complexity while maintaining model performance. Secondly, we specially design three lightweight operators for ASC, including Separable Convolution (SC), Orthonormal Separable Convolution (OSC), and Separable Partial Convolution (SPC). These operators exhibit highly efficient feature extraction capabilities in acoustic scene classification tasks. The experimental results demonstrate that the proposed method achieves a performance gain of 9.8% compared to the currently popular deep learning methods, while also having smaller parameter count and computational complexity.

Deep Space Separable Distillation for Lightweight Acoustic Scene Classification

TL;DR

This paper tackles acoustic scene classification (ASC) with the goal of achieving high accuracy while remaining lightweight for practical deployment. It introduces the Deep Space Separable Distillation Network (DSSDN), built from Deep Space Separable Operators (DSSO) and Deep Space Separable Distilled Blocks (DSSDB), and incorporates a log-Mel frequency-axis cutting strategy to emphasize informative low-frequency features. Three lightweight operators—Separable Convolution (SC), Orthonormal Separable Convolution (OSC), and Separable Partial Convolution (SPC)—drive three DSSDN variants (Large, Middle, Small), which substantially reduce parameters and MACs yet maintain competitive accuracy on the TAU Urban Acoustic Scenes 2020 Mobile dataset. Ablation studies confirm the contributions of both the DSSDB distillation blocks and the DSSO components. Overall, the approach achieves strong ASC performance with sub-1M parameter counts and sub-GMACs, enabling efficient deployment in real-world audio systems, with reported gains around 9.8 percentage points over mainstream baselines.

Abstract

Acoustic scene classification (ASC) is highly important in the real world. Recently, deep learning-based methods have been widely employed for acoustic scene classification. However, these methods are currently not lightweight enough as well as their performance is not satisfactory. To solve these problems, we propose a deep space separable distillation network. Firstly, the network performs high-low frequency decomposition on the log-mel spectrogram, significantly reducing computational complexity while maintaining model performance. Secondly, we specially design three lightweight operators for ASC, including Separable Convolution (SC), Orthonormal Separable Convolution (OSC), and Separable Partial Convolution (SPC). These operators exhibit highly efficient feature extraction capabilities in acoustic scene classification tasks. The experimental results demonstrate that the proposed method achieves a performance gain of 9.8% compared to the currently popular deep learning methods, while also having smaller parameter count and computational complexity.
Paper Structure (14 sections, 4 equations, 5 figures, 2 tables)

This paper contains 14 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Model performance comparison. This figure shows the accuracy, computational complexity, and model size of the three lightweight networks we proposed compared with common lightweight networks.
  • Figure 2: llustration of the proposed DSSDN framework. The backbone architecture of the proposed DSSDN is composed of five basic modules DSSDB stacked together, and then the channel splicing method is used to fuse the different scale information of the high-level and low-level networks processed by the five modules. DSSDB is built with DSSO as the basic unit, combining the characteristics of the log-Mel spectrum and the characteristics of the distillation block structure, and cutting the frequency axis.
  • Figure 3: Structure of Separable Convolution. The separable convolution block refers to a model that consists of two distinct channel-wise convolutions, each with kernels of size 3 $\times$1 and 1 $\times$ 3
  • Figure 4: Structure of OSC. OSC regularizes the 1 $\times$ 1 convolution in the separable convolution.
  • Figure 5: Structure of SPC. SPC inputs a portion of channels in the feature map into the separable convolution layer for processing.