Table of Contents
Fetching ...

Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

Lukas Rauch, René Heinrich, Houtan Ghaffari, Lukas Miklautz, Ilyass Moummad, Bernhard Sick, Christoph Scholz

TL;DR

This work identifies pooling as the key bottleneck in probing frozen audio SSL embeddings for multi-label tasks. It introduces binarized prototypical probes that perform per-class, multi-vector aggregation over the token map, significantly outperforming traditional linear and attentive probes across a large, diverse benchmark. The results show that probing with per-class prototypes provides a faithful, efficient assessment closer to fine-tuning performance, challenging the default reliance on costly fine-tuning in AudioSet. The approach offers substantial memory efficiency and robustness, with implications for evaluation practices in audio SSL and beyond.

Abstract

Although probing frozen models has become a standard evaluation paradigm, self-supervised learning in audio defaults to fine-tuning when pursuing state-of-the-art on AudioSet. A key reason is that global pooling creates an information bottleneck causing linear probes to misrepresent the embedding quality: The $\texttt{cls}$-token discards crucial token information about dispersed, localized events in audio. This weakness is rooted in the mismatch between the pretraining objective (globally) and the downstream task (localized). Across a comprehensive benchmark of 13 datasets and 6 spectrogram-based encoders, we investigate the global pooling bottleneck. We introduce binarized prototypical probes: a lightweight and simple pooling method that learns prototypes to perform class-wise information aggregation. Despite its simplicity, our method notably outperforms linear and attentive probing. Our work establishes probing as a competitive and efficient paradigm for evaluating audio SSL models, challenging the reliance on costly fine-tuning.

Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

TL;DR

This work identifies pooling as the key bottleneck in probing frozen audio SSL embeddings for multi-label tasks. It introduces binarized prototypical probes that perform per-class, multi-vector aggregation over the token map, significantly outperforming traditional linear and attentive probes across a large, diverse benchmark. The results show that probing with per-class prototypes provides a faithful, efficient assessment closer to fine-tuning performance, challenging the default reliance on costly fine-tuning in AudioSet. The approach offers substantial memory efficiency and robustness, with implications for evaluation practices in audio SSL and beyond.

Abstract

Although probing frozen models has become a standard evaluation paradigm, self-supervised learning in audio defaults to fine-tuning when pursuing state-of-the-art on AudioSet. A key reason is that global pooling creates an information bottleneck causing linear probes to misrepresent the embedding quality: The -token discards crucial token information about dispersed, localized events in audio. This weakness is rooted in the mismatch between the pretraining objective (globally) and the downstream task (localized). Across a comprehensive benchmark of 13 datasets and 6 spectrogram-based encoders, we investigate the global pooling bottleneck. We introduce binarized prototypical probes: a lightweight and simple pooling method that learns prototypes to perform class-wise information aggregation. Despite its simplicity, our method notably outperforms linear and attentive probing. Our work establishes probing as a competitive and efficient paradigm for evaluating audio SSL models, challenging the reliance on costly fine-tuning.

Paper Structure

This paper contains 19 sections, 8 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: The pooling bottleneck. Visualizing embeddings from a purely self-supervised model (EAT) and its supervised +-adapted version (EAT +) for a spectrogram from urban. (a) A PCA of the token map shows that EAT embeddings are rich but entangled, a result of the masked prediction objective, while EAT + embeddings are localized and aligned with input events. (b) The [cls]-token's attention starts similarly for both models, but is diffuse for EAT in later layers, while EAT + becomes spatially selective, highlighting its limitation as a probe vector. (c) Our protobin disentangles these correlated EAT embeddings to recover localized event information. (d) For the EAT + model, protobin further enhances the embeddings, providing a superior representation to the [cls]-token.
  • Figure 2: Probing on as20k with EAT.
  • Figure 3: Binarized prototypical pooling (schematic). Example shown for a base audio SSL backbone with $D{=}768$-dim tokens and a $64{\times}8$ token map. There are $J$ learnable prototypes, which are binarized on-the-fly. Tokens are matched against these prototypes, max pooling aggregates spatial evidence, and a final linear layer maps the resulting prototype scores to class logits.
  • Figure 4: Weights and similarities example. Trained protobin on urban.
  • Figure 5: Pairwise win matrices for pooling methods. Each cell shows the number of configurations where a method outperforms another (ties omitted, one sd above opponent), aggregated over all datasets and base (non-supervised +) backbones. Extracted from \ref{['tab:baseproberesults']} and \ref{['tab:appbioacoustic']} (Appendix \ref{['appsub:fewshotbirdset']}).
  • ...and 3 more figures