Table of Contents
Fetching ...

Streaming Audio Transformers for Online Audio Tagging

Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

TL;DR

The paper tackles online audio tagging with transformer models by introducing Streaming Audio Transformers (SAT) that integrate Vision Transformer backbones with Transformer-XL-like chunking to support streaming inference with minimal delay and memory. The authors pretrain with masked autoencoders, finetune on full-context Audioset, and use pseudo strong labels to train SAT models under short delays ($2$s and $1$s). They demonstrate that SAT-T/SAT-S/SAT-B achieve competitive or superior mAP at short delays while dramatically reducing memory and computation compared with offline SOTA methods such as AST, BEATs, and HTS-AT, and they show improved segment-level tagging and long-duration event detection. The results suggest SAT enables practical real-world streaming audio tagging on commodity hardware, enabling faster responses and robust long-range event detection.

Abstract

Transformers have emerged as a prominent model framework for audio tagging (AT), boasting state-of-the-art (SOTA) performance on the widely-used Audioset dataset. However, their impressive performance often comes at the cost of high memory usage, slow inference speed, and considerable model delay, rendering them impractical for real-world AT applications. In this study, we introduce streaming audio transformers (SAT) that combine the vision transformer (ViT) architecture with Transformer-Xl-like chunk processing, enabling efficient processing of long-range audio signals. Our proposed SAT is benchmarked against other transformer-based SOTA methods, achieving significant improvements in terms of mean average precision (mAP) at a delay of 2s and 1s, while also exhibiting significantly lower memory usage and computational overhead. Checkpoints are publicly available https://github.com/RicherMans/SAT.

Streaming Audio Transformers for Online Audio Tagging

TL;DR

The paper tackles online audio tagging with transformer models by introducing Streaming Audio Transformers (SAT) that integrate Vision Transformer backbones with Transformer-XL-like chunking to support streaming inference with minimal delay and memory. The authors pretrain with masked autoencoders, finetune on full-context Audioset, and use pseudo strong labels to train SAT models under short delays (s and s). They demonstrate that SAT-T/SAT-S/SAT-B achieve competitive or superior mAP at short delays while dramatically reducing memory and computation compared with offline SOTA methods such as AST, BEATs, and HTS-AT, and they show improved segment-level tagging and long-duration event detection. The results suggest SAT enables practical real-world streaming audio tagging on commodity hardware, enabling faster responses and robust long-range event detection.

Abstract

Transformers have emerged as a prominent model framework for audio tagging (AT), boasting state-of-the-art (SOTA) performance on the widely-used Audioset dataset. However, their impressive performance often comes at the cost of high memory usage, slow inference speed, and considerable model delay, rendering them impractical for real-world AT applications. In this study, we introduce streaming audio transformers (SAT) that combine the vision transformer (ViT) architecture with Transformer-Xl-like chunk processing, enabling efficient processing of long-range audio signals. Our proposed SAT is benchmarked against other transformer-based SOTA methods, achieving significant improvements in terms of mean average precision (mAP) at a delay of 2s and 1s, while also exhibiting significantly lower memory usage and computational overhead. Checkpoints are publicly available https://github.com/RicherMans/SAT.
Paper Structure (14 sections, 3 equations, 2 figures, 3 tables)

This paper contains 14 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The proposed training pipeline consists of three stages. First, pretraining using masked auto-encoders (MAE), second we use standard full-context training (10s clips) and third, our best model (ViT-B) is used to predict labels on a fine scale for SAT training.
  • Figure 2: Comparison of output probability scores between the baselines against the proposed SAT-T for a 10-minute long sound of water event. Samples (Top: mg4kDY_hy6o, Bottom: jkLRith2wcc) were taken from Youtube and evaluated using 2s (left) and 10s (right) chunks. Best viewed in color.