Table of Contents
Fetching ...

Audio Transformers

Prateek Verma, Jonathan Berger

TL;DR

The paper tackles large-scale audio understanding by replacing convolutional front ends with a pure Transformer architecture operating on raw waveforms. It introduces a learnable front end, a multi-layer Transformer backbone, and pooling and multi-scale embedding strategies inspired by wavelets to capture time–frequency structure without convolutions. Empirical results on the FSD50K dataset show that even compact Transformer variants outperform CNN baselines, with the best performance achieved by a large Transformer employing multi-scale filters. The work demonstrates that attention-based models can learn adaptive time–frequency representations and long-range dependencies for audio, suggesting broad potential for end-to-end audio understanding and future exploration of sparse transformers and unsupervised pre-training.

Abstract

Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be optimized for a particular task of interest, over a wide variety of datasets and labels. In fact similar architectures designed for image understanding have proven effective for acoustic scene analysis. Here we propose applying Transformer based architectures without convolutional layers to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200 categories, our model outperforms convolutional models to produce state of the art results. This is significant as unlike in natural language processing and computer vision, we do not perform unsupervised pre-training for outperforming convolutional architectures. On the same training set, with respect mean aver-age precision benchmarks, we show a significant improvement. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work designed in the past few years. In addition, we also show how multi-rate signal processing ideas inspired from wavelets, can be applied to the Transformer embeddings to improve the results. We also show how our models learns a non-linear non constant band-width filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding, different from other tasks e.g. pitch estimation.

Audio Transformers

TL;DR

The paper tackles large-scale audio understanding by replacing convolutional front ends with a pure Transformer architecture operating on raw waveforms. It introduces a learnable front end, a multi-layer Transformer backbone, and pooling and multi-scale embedding strategies inspired by wavelets to capture time–frequency structure without convolutions. Empirical results on the FSD50K dataset show that even compact Transformer variants outperform CNN baselines, with the best performance achieved by a large Transformer employing multi-scale filters. The work demonstrates that attention-based models can learn adaptive time–frequency representations and long-range dependencies for audio, suggesting broad potential for end-to-end audio understanding and future exploration of sparse transformers and unsupervised pre-training.

Abstract

Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be optimized for a particular task of interest, over a wide variety of datasets and labels. In fact similar architectures designed for image understanding have proven effective for acoustic scene analysis. Here we propose applying Transformer based architectures without convolutional layers to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200 categories, our model outperforms convolutional models to produce state of the art results. This is significant as unlike in natural language processing and computer vision, we do not perform unsupervised pre-training for outperforming convolutional architectures. On the same training set, with respect mean aver-age precision benchmarks, we show a significant improvement. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work designed in the past few years. In addition, we also show how multi-rate signal processing ideas inspired from wavelets, can be applied to the Transformer embeddings to improve the results. We also show how our models learns a non-linear non constant band-width filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding, different from other tasks e.g. pitch estimation.

Paper Structure

This paper contains 12 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: An overview of the proposed Audio Transformer architecture using front end fully connected encoder with Transformer layers and pooling layers. It takes 1s of input, and divides it into patches of size 25ms, followed by learning a front end, to feed it to Transformer.
  • Figure 2: Core idea of wavelets utilizing multi-scale learning on (left) from berger1994removing, and using them to create a layer that operates on intermediate Transformer embeddings at various scales. We show a demo signal and we retain half of them, and modify the other half using variable sized windows.
  • Figure 3: Sorted filters, learned by the front end, learns a problem specific non linear, non constant bandwidth filter-bank. This is shown by comparing it to that learned by the same front end for polyphonic pitch estimation as shown in verma2016frequency.
  • Figure 4: Filters learned from the first layer of front end show strong correlations to signal processing, particularly learning sinusoidal signals, onset detectors, energy envelops, and windowing functions