Table of Contents
Fetching ...

VI-PANN: Harnessing Transfer Learning and Uncertainty-Aware Variational Inference for Improved Generalization in Audio Pattern Recognition

John Fischer, Marko Orescanin, Eric Eckstrand

TL;DR

Deterministic audio transfer learning often lacks calibrated epistemic uncertainty, limiting reliability in downstream tasks. The authors propose VI-PANNs, Bayesian variants of a ResNet-54 backbone pretrained on AudioSet, leveraging MC dropout and Flipout to obtain calibrated uncertainty estimates and transferring them to ESC-50, UrbanSound8K, and DCASE2013. A new multi-label uncertainty decomposition method enables nuanced analysis of predictive uncertainty across datasets, and three Bayesian transfer-learning strategies (Flip, Det-Flip, Drop) demonstrate competitive performance with improved reliability. The results show that Flipout VI-PANNs achieve well-calibrated uncertainty and that uncertainty-aware transfer learning can improve generalization, including under out-of-distribution conditions like ShipsEar, highlighting practical significance for robust audio pattern recognition.

Abstract

Transfer learning (TL) is an increasingly popular approach to training deep learning (DL) models that leverages the knowledge gained by training a foundation model on diverse, large-scale datasets for use on downstream tasks where less domain- or task-specific data is available. The literature is rich with TL techniques and applications; however, the bulk of the research makes use of deterministic DL models which are often uncalibrated and lack the ability to communicate a measure of epistemic (model) uncertainty in prediction. Unlike their deterministic counterparts, Bayesian DL (BDL) models are often well-calibrated, provide access to epistemic uncertainty for a prediction, and are capable of achieving competitive predictive performance. In this study, we propose variational inference pre-trained audio neural networks (VI-PANNs). VI-PANNs are a variational inference variant of the popular ResNet-54 architecture which are pre-trained on AudioSet, a large-scale audio event detection dataset. We evaluate the quality of the resulting uncertainty when transferring knowledge from VI-PANNs to other downstream acoustic classification tasks using the ESC-50, UrbanSound8K, and DCASE2013 datasets. We demonstrate, for the first time, that it is possible to transfer calibrated uncertainty information along with knowledge from upstream tasks to enhance a model's capability to perform downstream tasks.

VI-PANN: Harnessing Transfer Learning and Uncertainty-Aware Variational Inference for Improved Generalization in Audio Pattern Recognition

TL;DR

Deterministic audio transfer learning often lacks calibrated epistemic uncertainty, limiting reliability in downstream tasks. The authors propose VI-PANNs, Bayesian variants of a ResNet-54 backbone pretrained on AudioSet, leveraging MC dropout and Flipout to obtain calibrated uncertainty estimates and transferring them to ESC-50, UrbanSound8K, and DCASE2013. A new multi-label uncertainty decomposition method enables nuanced analysis of predictive uncertainty across datasets, and three Bayesian transfer-learning strategies (Flip, Det-Flip, Drop) demonstrate competitive performance with improved reliability. The results show that Flipout VI-PANNs achieve well-calibrated uncertainty and that uncertainty-aware transfer learning can improve generalization, including under out-of-distribution conditions like ShipsEar, highlighting practical significance for robust audio pattern recognition.

Abstract

Transfer learning (TL) is an increasingly popular approach to training deep learning (DL) models that leverages the knowledge gained by training a foundation model on diverse, large-scale datasets for use on downstream tasks where less domain- or task-specific data is available. The literature is rich with TL techniques and applications; however, the bulk of the research makes use of deterministic DL models which are often uncalibrated and lack the ability to communicate a measure of epistemic (model) uncertainty in prediction. Unlike their deterministic counterparts, Bayesian DL (BDL) models are often well-calibrated, provide access to epistemic uncertainty for a prediction, and are capable of achieving competitive predictive performance. In this study, we propose variational inference pre-trained audio neural networks (VI-PANNs). VI-PANNs are a variational inference variant of the popular ResNet-54 architecture which are pre-trained on AudioSet, a large-scale audio event detection dataset. We evaluate the quality of the resulting uncertainty when transferring knowledge from VI-PANNs to other downstream acoustic classification tasks using the ESC-50, UrbanSound8K, and DCASE2013 datasets. We demonstrate, for the first time, that it is possible to transfer calibrated uncertainty information along with knowledge from upstream tasks to enhance a model's capability to perform downstream tasks.
Paper Structure (17 sections, 8 equations, 11 figures, 5 tables)

This paper contains 17 sections, 8 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Uncertainty calibration plots for foundation model training on AudioSet. Comparison plot of test set accuracy vs. percentage of evaluation data retained based on entropy (left), epistemic uncertainty (center), and aleatoric uncertainty (right). Shading represents a 95% CI.
  • Figure 2: Uncertainty box plots depicting results of MC Dropout model (top row) and Flipout model(bottom row) trained on AudioSet. The plots compare predictive entropy (left), epistemic uncertainty (middle), and aleatoric uncertainty (right) as the models are evaluated on both the AudioSet test set and the ShipsEar dataset. Both the median (orange line) and mean (dashed green line) are presented.
  • Figure 3: Uncertainty calibration plots comparing fixed-feature and fine-tuning TL techniques on UrbanSound8K. Comparison plots of test set accuracy vs. percentage of evaluation data retained based on Entropy (top), Epistemic Uncertainty (middle) and Aleatoric Uncertainty (bottom). Drop VI-PANN is on the left, Det-Flip VI-PANN in the center, and Flip VI-PANN on the right. Shading represents a 95% CI.
  • Figure 4: Uncertainty calibration plots comparing Drop, Flip, and Det-Flip VI-PANN variants on UrbanSound8k. Comparison plots of test set accuracy vs. percentage of evaluation data retained based on Entropy (left), Epistemic Uncertainty (center) and Aleatoric Uncertainty (right). Plots corresponding to fine-tuned models are on the top, fixed-feature model plots are on the bottom. Shading represents a 95% CI.
  • Figure 5: Uncertainty box plots depicting results of MC Dropout (top row), Flipout (middle row), and Det-Flip (bottom row) fine tuned on UrbanSound8k. The plots compare predictive entropy (left), epistemic uncertainty (middle), and aleatoric uncertainty (right) as the models are evaluated on both UrbanSound8k and the ShipsEar dataset. Both the median (orange line) and mean (dashed green line) are presented.
  • ...and 6 more figures