Table of Contents
Fetching ...

How Class Ontology and Data Scale Affect Audio Transfer Learning

Manuel Milling, Andreas Triantafyllopoulos, Alexander Gebhard, Simon Rampp, Björn W. Schuller

Abstract

Transfer learning is a crucial concept within deep learning that allows artificial neural networks to benefit from a large pre-training data basis when confronted with a task of limited data. Despite its ubiquitous use and clear benefits, there are still many open questions regarding the inner workings of transfer learning and, in particular, regarding the understanding of when and how well it works. To that extent, we perform a rigorous study focusing on audio-to-audio transfer learning, in which we pre-train various model states on (ontology-based) subsets of AudioSet and fine-tune them on three computer audition tasks, namely acoustic scene recognition, bird activity recognition, and speech command recognition. We report that increasing the number of samples and classes in the pre-training data both have a positive impact on transfer learning. This is, however, generally surpassed by similarity between pre-training and the downstream task, which can lead the model to learn comparable features.

How Class Ontology and Data Scale Affect Audio Transfer Learning

Abstract

Transfer learning is a crucial concept within deep learning that allows artificial neural networks to benefit from a large pre-training data basis when confronted with a task of limited data. Despite its ubiquitous use and clear benefits, there are still many open questions regarding the inner workings of transfer learning and, in particular, regarding the understanding of when and how well it works. To that extent, we perform a rigorous study focusing on audio-to-audio transfer learning, in which we pre-train various model states on (ontology-based) subsets of AudioSet and fine-tune them on three computer audition tasks, namely acoustic scene recognition, bird activity recognition, and speech command recognition. We report that increasing the number of samples and classes in the pre-training data both have a positive impact on transfer learning. This is, however, generally surpassed by similarity between pre-training and the downstream task, which can lead the model to learn comparable features.

Paper Structure

This paper contains 12 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Excerpt of the AudioSet ontology, including specific nodes being considered in the selection of pre-training data. We colour the nodes whose branches explicitly serve as a data basis for pre-training schematically, based on whether they are focused on sounds related to humans (blue), nature (red), or mechanical things (green).
  • Figure 2: Performance of fine-tuning experiments on the three datasets ASC, BAD, and SCR w. r. t. the number of pre-training samples. We consider different pre-training states trained on randomly sampled (orange) and ontology-based (blue) subsets of AS. Results are averaged across three random seeds. Note the difference in performance scale across the fine-tuning tasks.
  • Figure 3: Performance of fine-tuning experiments on the three datasets ASC, BAD and SCR w. r. t. the number of pre-training classes. We consider different pre-training states trained on ontology-based subsets of AS. Results are averaged across three random seeds. Note the difference in performance scale across the fine-tuning tasks.
  • Figure 4: Pair-wise cosine distance between the first convolutional layer of model states, pre-trained on a different subset of AudioSet.