Table of Contents
Fetching ...

Data-Efficient Low-Complexity Acoustic Scene Classification via Distilling and Progressive Pruning

Bing Han, Wen Huang, Zhengyang Chen, Anbai Jiang, Pingyi Fan, Cheng Lu, Zhiqiang Lv, Jia Liu, Wei-Qiang Zhang, Yanmin Qian

TL;DR

The paper tackles data-efficient, low-complexity acoustic scene classification by introducing Rep-Mobile, a reparameterizable multi-branch CNN architecture, and by leveraging ensemble knowledge distillation with transformer-heavy teachers. It further improves efficiency and accuracy through progressive pruning, enabling staged compression while preserving performance. Experiments on the TAU dataset demonstrate state-of-the-art results and first place in the DCASE2024 Task1 challenge, highlighting practical benefits for deployment on resource-limited devices. The combination of architectural design, distillation strategy, and pruning offers a robust blueprint for accurate ASC under real-world constraints.

Abstract

The goal of the acoustic scene classification (ASC) task is to classify recordings into one of the predefined acoustic scene classes. However, in real-world scenarios, ASC systems often encounter challenges such as recording device mismatch, low-complexity constraints, and the limited availability of labeled data. To alleviate these issues, in this paper, a data-efficient and low-complexity ASC system is built with a new model architecture and better training strategies. Specifically, we firstly design a new low-complexity architecture named Rep-Mobile by integrating multi-convolution branches which can be reparameterized at inference. Compared to other models, it achieves better performance and less computational complexity. Then we apply the knowledge distillation strategy and provide a comparison of the data efficiency of the teacher model with different architectures. Finally, we propose a progressive pruning strategy, which involves pruning the model multiple times in small amounts, resulting in better performance compared to a single step pruning. Experiments are conducted on the TAU dataset. With Rep-Mobile and these training strategies, our proposed ASC system achieves the state-of-the-art (SOTA) results so far, while also winning the first place with a significant advantage over others in the DCASE2024 Challenge.

Data-Efficient Low-Complexity Acoustic Scene Classification via Distilling and Progressive Pruning

TL;DR

The paper tackles data-efficient, low-complexity acoustic scene classification by introducing Rep-Mobile, a reparameterizable multi-branch CNN architecture, and by leveraging ensemble knowledge distillation with transformer-heavy teachers. It further improves efficiency and accuracy through progressive pruning, enabling staged compression while preserving performance. Experiments on the TAU dataset demonstrate state-of-the-art results and first place in the DCASE2024 Task1 challenge, highlighting practical benefits for deployment on resource-limited devices. The combination of architectural design, distillation strategy, and pruning offers a robust blueprint for accurate ASC under real-world constraints.

Abstract

The goal of the acoustic scene classification (ASC) task is to classify recordings into one of the predefined acoustic scene classes. However, in real-world scenarios, ASC systems often encounter challenges such as recording device mismatch, low-complexity constraints, and the limited availability of labeled data. To alleviate these issues, in this paper, a data-efficient and low-complexity ASC system is built with a new model architecture and better training strategies. Specifically, we firstly design a new low-complexity architecture named Rep-Mobile by integrating multi-convolution branches which can be reparameterized at inference. Compared to other models, it achieves better performance and less computational complexity. Then we apply the knowledge distillation strategy and provide a comparison of the data efficiency of the teacher model with different architectures. Finally, we propose a progressive pruning strategy, which involves pruning the model multiple times in small amounts, resulting in better performance compared to a single step pruning. Experiments are conducted on the TAU dataset. With Rep-Mobile and these training strategies, our proposed ASC system achieves the state-of-the-art (SOTA) results so far, while also winning the first place with a significant advantage over others in the DCASE2024 Challenge.

Paper Structure

This paper contains 17 sections, 1 equation, 2 figures, 6 tables.

Figures (2)

  • Figure 1: The reparameterization process of CNN block in Rep-Mobile. During training, multiple branches are used to enhance modeling ability, and multiple branches are merged through reparameterization without increasing computational complexity during the inference. BN denotes BatchNorm layer.
  • Figure 2: Illustration of knowledge distillation with ensemble teachers to low-complexity Rep-Mobile. Note that snowflake represents that the parameters of teachers are frozen, while the flame icon representes that the student model are updated with gradient.