Table of Contents
Fetching ...

BAT: Better Audio Transformer Guided by Convex Gated Probing

Houtan Ghaffari, Lukas Rauch, Christoph Scholz, Paul Devos

TL;DR

Convex Gated Probing (CGP) is introduced, a prototype-based method that drastically closes the gap between fine-tuning and probing in audio and rework the entire SSL pipeline of current SOTA audio models that use legacy implementations of prior SSL methods.

Abstract

Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as fine-tuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on fine-tuning because simple probing fails to unlock their full potential and alters their rankings when competing for SOTA on AudioSet. Hence, a robust and efficient probing mechanism is required to guide the trajectory of audio SSL towards reliable and reproducible methods. We introduce Convex Gated Probing (CGP), a prototype-based method that drastically closes the gap between fine-tuning and probing in audio. CGP efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. Guided by CGP, we rework the entire SSL pipeline of current SOTA audio models that use legacy implementations of prior SSL methods. By refining data preprocessing, model architecture, and pre-training recipe, we introduce Better Audio Transformer (BAT), and establish new SOTA on audio benchmarks.

BAT: Better Audio Transformer Guided by Convex Gated Probing

TL;DR

Convex Gated Probing (CGP) is introduced, a prototype-based method that drastically closes the gap between fine-tuning and probing in audio and rework the entire SSL pipeline of current SOTA audio models that use legacy implementations of prior SSL methods.

Abstract

Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as fine-tuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on fine-tuning because simple probing fails to unlock their full potential and alters their rankings when competing for SOTA on AudioSet. Hence, a robust and efficient probing mechanism is required to guide the trajectory of audio SSL towards reliable and reproducible methods. We introduce Convex Gated Probing (CGP), a prototype-based method that drastically closes the gap between fine-tuning and probing in audio. CGP efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. Guided by CGP, we rework the entire SSL pipeline of current SOTA audio models that use legacy implementations of prior SSL methods. By refining data preprocessing, model architecture, and pre-training recipe, we introduce Better Audio Transformer (BAT), and establish new SOTA on audio benchmarks.
Paper Structure (16 sections, 8 equations, 6 figures, 7 tables)

This paper contains 16 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Convex Gated Probing (CGP). We illustrate the probing process of a spectrogram embedding for a ViT-Base backbone. CGP applies a learnable soft-gating vector (softmax) to compute a weighted sum of embeddings from all layers ($L$). The gating aggregates the hierarchy into a single representation, which is then compared against $K$ prototypes. The cosine similarities of the patch embeddings are min-max pooled and concatenated with the ones from the cls-token, resulting in $3K$ features for a linear classifier.
  • Figure 2: CGP Ablation. Increasing the number of prototypes constantly improves results, but yields diminishing returns. The best computation-performance trade-off is dataset-dependent.
  • Figure 3: Impact of audio frontend. A recording containing the labels [Whimper, Gasp, Speech, Outside, urban or manmade]. (a) Our incorporated audio frontend: Mel-spectrogram with decibel compression and local min-max normalization, exhibiting clear spectral structure and high contrast. (b) Audio-MAE, EAT, and SSLAM: simple log, filtering, Mel-spectrogram, and global standardization. Note the artifacts and blurring, particularly at lower frequencies.
  • Figure 4: Layer-wise latent information. We display the layer-wise latent information quality across three models on AS-20k: (a) BAT with the lightweight CNN (best performer from Table 3), (b) EAT (baseline), and (c) our final BAT (ViT decoder). The top row displays the linear probing performance of each Transformer block. The bottom row visualizes the learned gating weights from CGP. Notably, the standard EAT (b) and the CNN-based BAT (a) exhibit a middle-heavy distribution where semantic information peaks early. In contrast, the heavy ViT decoder in the final BAT (c) shifts the semantic peak toward the later layers, improving linear separability at the output.
  • Figure 5: Impact of gating on attentions. Gating distributes the attention better and focuses more on the token itself, rather than sinking into one token, primarily the cls-token.
  • ...and 1 more figures