Table of Contents
Fetching ...

Pre-training with Synthetic Patterns for Audio

Yuchi Ishikawa, Tatsuya Komatsu, Yoshimitsu Aoki

TL;DR

By combining MAEs and synthetic patterns, this framework enables the model to learn generalized feature representations without real data, while addressing the issues related to real audio.

Abstract

In this paper, we propose to pre-train audio encoders using synthetic patterns instead of real audio data. Our proposed framework consists of two key elements. The first one is Masked Autoencoder (MAE), a self-supervised learning framework that learns from reconstructing data from randomly masked counterparts. MAEs tend to focus on low-level information such as visual patterns and regularities within data. Therefore, it is unimportant what is portrayed in the input, whether it be images, audio mel-spectrograms, or even synthetic patterns. This leads to the second key element, which is synthetic data. Synthetic data, unlike real audio, is free from privacy and licensing infringement issues. By combining MAEs and synthetic patterns, our framework enables the model to learn generalized feature representations without real data, while addressing the issues related to real audio. To evaluate the efficacy of our framework, we conduct extensive experiments across a total of 13 audio tasks and 17 synthetic datasets. The experiments provide insights into which types of synthetic patterns are effective for audio. Our results demonstrate that our framework achieves performance comparable to models pre-trained on AudioSet-2M and partially outperforms image-based pre-training methods.

Pre-training with Synthetic Patterns for Audio

TL;DR

By combining MAEs and synthetic patterns, this framework enables the model to learn generalized feature representations without real data, while addressing the issues related to real audio.

Abstract

In this paper, we propose to pre-train audio encoders using synthetic patterns instead of real audio data. Our proposed framework consists of two key elements. The first one is Masked Autoencoder (MAE), a self-supervised learning framework that learns from reconstructing data from randomly masked counterparts. MAEs tend to focus on low-level information such as visual patterns and regularities within data. Therefore, it is unimportant what is portrayed in the input, whether it be images, audio mel-spectrograms, or even synthetic patterns. This leads to the second key element, which is synthetic data. Synthetic data, unlike real audio, is free from privacy and licensing infringement issues. By combining MAEs and synthetic patterns, our framework enables the model to learn generalized feature representations without real data, while addressing the issues related to real audio. To evaluate the efficacy of our framework, we conduct extensive experiments across a total of 13 audio tasks and 17 synthetic datasets. The experiments provide insights into which types of synthetic patterns are effective for audio. Our results demonstrate that our framework achieves performance comparable to models pre-trained on AudioSet-2M and partially outperforms image-based pre-training methods.
Paper Structure (11 sections, 3 figures, 3 tables)

This paper contains 11 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our proposed framework. In our framework, we first pre-train a Masked Autoencoder (MAE) using synthetic patterns, and then finetune its encoder part for downstream audio tasks. This approach eliminates the need for real data during pre-training.
  • Figure 2: Examples of synthetic image dataset used in our work. Datasets (a-n) are proposed in baradad2021learning. (a-d) Dead-leave models, (e-h) Statistical image models, (i-l) StyleGAN-based models, and (m-n) Feature visualization. Datasets (o-q) are large-scale synthetic datasets. (o) Shaders1k baradad2022procedural, (p) FractalDB1k kataoka2020pre, and (q) VisualAtom1k takashima2023visual.
  • Figure 3: Correlation between synthetic image properties and performance on ESC-50 fold 5. Note that we use only small-scale datasets (a-n) for calculating the correlation coefficient $r$.