Table of Contents
Fetching ...

Creating a Good Teacher for Knowledge Distillation in Acoustic Scene Classification

Tobias Morocutti, Florian Schmid, Khaled Koutini, Gerhard Widmer

TL;DR

This work investigates how different teacher attributes affect Knowledge Distillation for low-complexity Acoustic Scene Classification under edge constraints. It systematically evaluates three teacher architectures (CP-Mobile, CP-ResNet, PaSST), scales teacher size, applies device-generalization augmentations (DIR, FMS, and their combination), and experiments with teacher ensembles to train a CPM student (≈128K parameters). Key findings show that smaller CNN teachers can outperform larger ones under KD, and that device-generalization methods substantially improve both teacher and student performance, especially on unseen devices; cross-architecture teacher–student pairings further enhance results. The best performance emerges from ensembles of CPR and PaSST trained with DIRFMS or FMS, achieving around 65.81% validation accuracy for a 128K-parameter student with 32 million MACs, highlighting practical strategies for deployable ASC systems on edge devices.

Abstract

Knowledge Distillation (KD) is a widespread technique for compressing the knowledge of large models into more compact and efficient models. KD has proved to be highly effective in building well-performing low-complexity Acoustic Scene Classification (ASC) systems and was used in all the top-ranked submissions to this task of the annual DCASE challenge in the past three years. There is extensive research available on establishing the KD process, designing efficient student models, and forming well-performing teacher ensembles. However, less research has been conducted on investigating which teacher model attributes are beneficial for low-complexity students. In this work, we try to close this gap by studying the effects on the student's performance when using different teacher network architectures, varying the teacher model size, training them with different device generalization methods, and applying different ensembling strategies. The results show that teacher model sizes, device generalization methods, the ensembling strategy and the ensemble size are key factors for a well-performing student network.

Creating a Good Teacher for Knowledge Distillation in Acoustic Scene Classification

TL;DR

This work investigates how different teacher attributes affect Knowledge Distillation for low-complexity Acoustic Scene Classification under edge constraints. It systematically evaluates three teacher architectures (CP-Mobile, CP-ResNet, PaSST), scales teacher size, applies device-generalization augmentations (DIR, FMS, and their combination), and experiments with teacher ensembles to train a CPM student (≈128K parameters). Key findings show that smaller CNN teachers can outperform larger ones under KD, and that device-generalization methods substantially improve both teacher and student performance, especially on unseen devices; cross-architecture teacher–student pairings further enhance results. The best performance emerges from ensembles of CPR and PaSST trained with DIRFMS or FMS, achieving around 65.81% validation accuracy for a 128K-parameter student with 32 million MACs, highlighting practical strategies for deployable ASC systems on edge devices.

Abstract

Knowledge Distillation (KD) is a widespread technique for compressing the knowledge of large models into more compact and efficient models. KD has proved to be highly effective in building well-performing low-complexity Acoustic Scene Classification (ASC) systems and was used in all the top-ranked submissions to this task of the annual DCASE challenge in the past three years. There is extensive research available on establishing the KD process, designing efficient student models, and forming well-performing teacher ensembles. However, less research has been conducted on investigating which teacher model attributes are beneficial for low-complexity students. In this work, we try to close this gap by studying the effects on the student's performance when using different teacher network architectures, varying the teacher model size, training them with different device generalization methods, and applying different ensembling strategies. The results show that teacher model sizes, device generalization methods, the ensembling strategy and the ensemble size are key factors for a well-performing student network.

Paper Structure

This paper contains 16 sections, 1 equation, 4 tables.