Streaming Audio Transformers for Online Audio Tagging
Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang
TL;DR
The paper tackles online audio tagging with transformer models by introducing Streaming Audio Transformers (SAT) that integrate Vision Transformer backbones with Transformer-XL-like chunking to support streaming inference with minimal delay and memory. The authors pretrain with masked autoencoders, finetune on full-context Audioset, and use pseudo strong labels to train SAT models under short delays ($2$s and $1$s). They demonstrate that SAT-T/SAT-S/SAT-B achieve competitive or superior mAP at short delays while dramatically reducing memory and computation compared with offline SOTA methods such as AST, BEATs, and HTS-AT, and they show improved segment-level tagging and long-duration event detection. The results suggest SAT enables practical real-world streaming audio tagging on commodity hardware, enabling faster responses and robust long-range event detection.
Abstract
Transformers have emerged as a prominent model framework for audio tagging (AT), boasting state-of-the-art (SOTA) performance on the widely-used Audioset dataset. However, their impressive performance often comes at the cost of high memory usage, slow inference speed, and considerable model delay, rendering them impractical for real-world AT applications. In this study, we introduce streaming audio transformers (SAT) that combine the vision transformer (ViT) architecture with Transformer-Xl-like chunk processing, enabling efficient processing of long-range audio signals. Our proposed SAT is benchmarked against other transformer-based SOTA methods, achieving significant improvements in terms of mean average precision (mAP) at a delay of 2s and 1s, while also exhibiting significantly lower memory usage and computational overhead. Checkpoints are publicly available https://github.com/RicherMans/SAT.
