Table of Contents
Fetching ...

Content Adaptive Front End For Audio Classification

Prateek Verma, Chris Chafe

TL;DR

The paper addresses the need for adaptable time-frequency representations in audio tasks by proposing a content-adaptive, learnable front end built from a bank of convolutional filters. A sparse routing mechanism selects the most relevant filter bank for each input patch, and the entire front end is trained end-to-end alongside a transformer backbone, using a final 200-dim representation optimized with a Huber loss. Empirical analyses on FSD-50K and NSynth demonstrate interpretable learned filters and meaningful clustering of instrument families, with max-pooling and a 5-filter-bank configuration delivering strong performance and faster convergence compared to baselines. The work suggests broad applicability of content-adaptive front ends for various audio processing tasks, potentially enhancing downstream performance beyond the presented experiments.

Abstract

We propose a learnable content adaptive front end for audio signal processing. Before the modern advent of deep learning, we used fixed representation non-learnable front-ends like spectrogram or mel-spectrogram with/without neural architectures. With convolutional architectures supporting various applications such as ASR and acoustic scene understanding, a shift to a learnable front ends occurred in which both the type of basis functions and the weight were learned from scratch and optimized for the particular task of interest. With the shift to transformer-based architectures with no convolutional blocks present, a linear layer projects small waveform patches onto a small latent dimension before feeding them to a transformer architecture. In this work, we propose a way of computing a content-adaptive learnable time-frequency representation. We pass each audio signal through a bank of convolutional filters, each giving a fixed-dimensional vector. It is akin to learning a bank of finite impulse-response filterbanks and passing the input signal through the optimum filter bank depending on the content of the input signal. A content-adaptive learnable time-frequency representation may be more broadly applicable, beyond the experiments in this paper.

Content Adaptive Front End For Audio Classification

TL;DR

The paper addresses the need for adaptable time-frequency representations in audio tasks by proposing a content-adaptive, learnable front end built from a bank of convolutional filters. A sparse routing mechanism selects the most relevant filter bank for each input patch, and the entire front end is trained end-to-end alongside a transformer backbone, using a final 200-dim representation optimized with a Huber loss. Empirical analyses on FSD-50K and NSynth demonstrate interpretable learned filters and meaningful clustering of instrument families, with max-pooling and a 5-filter-bank configuration delivering strong performance and faster convergence compared to baselines. The work suggests broad applicability of content-adaptive front ends for various audio processing tasks, potentially enhancing downstream performance beyond the presented experiments.

Abstract

We propose a learnable content adaptive front end for audio signal processing. Before the modern advent of deep learning, we used fixed representation non-learnable front-ends like spectrogram or mel-spectrogram with/without neural architectures. With convolutional architectures supporting various applications such as ASR and acoustic scene understanding, a shift to a learnable front ends occurred in which both the type of basis functions and the weight were learned from scratch and optimized for the particular task of interest. With the shift to transformer-based architectures with no convolutional blocks present, a linear layer projects small waveform patches onto a small latent dimension before feeding them to a transformer architecture. In this work, we propose a way of computing a content-adaptive learnable time-frequency representation. We pass each audio signal through a bank of convolutional filters, each giving a fixed-dimensional vector. It is akin to learning a bank of finite impulse-response filterbanks and passing the input signal through the optimum filter bank depending on the content of the input signal. A content-adaptive learnable time-frequency representation may be more broadly applicable, beyond the experiments in this paper.
Paper Structure (11 sections, 1 equation, 4 figures, 1 table)

This paper contains 11 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Our proposed method of computing the front end compared to a mixture of experts model proposed by Jacobs et. al jacobs1991adaptive. We learn a bank of convolutional filters that can be thought of as a set of finite impulse response filterbanks.
  • Figure 2: Conv filters from 1st/2nd filter-bank when $N_f$ = 2
  • Figure 3: Distance matrix of mixture weights of a particular sound (filter-index routed to as a 5-dim vector)
  • Figure 4: Top-5 accuracy(1s patches) for our bank of filterbanks (BF) front-end compared to mixture of experts (ME). Better results are obtained for BF using max-pooling(Pink) compared to the same $N_f$ =5 using avg-pooling (Brown). BF avg-pooling for $N_f$ =2 (Purple), $N_f$ =10 (Red). ME with $N_f$ = 2,10,5 (respectively, Orange, Blue, Green). The performance with Audio Transformer [5] corresponding to $N_f$ =1 is slightly below ME with $N_f$ =1 (not shown).