Table of Contents
Fetching ...

BEAST: Online Joint Beat and Downbeat Tracking Based on Streaming Transformer

Chih-Cheng Chang, Li Su

TL;DR

BEAST tackles online beat and downbeat tracking by leveraging a streaming Transformer with contextual block processing and relative positional encoding to achieve low-latency inference. The method processes partially available audio frames through block-based attention while preserving long-range dependencies via inherited context and a relative timing bias. Empirical results show BEAST outperforms prior online models, attaining approximately 80% beat F1 and 46.8% downbeat F1 under latency under 50 ms, with robust performance across multiple datasets. This work demonstrates the potential of streaming Transformers in MIR and suggests broad applicability to real-time transcription and accompaniment tasks.

Abstract

Many deep learning models have achieved dominant performance on the offline beat tracking task. However, online beat tracking, in which only the past and present input features are available, still remains challenging. In this paper, we propose BEAt tracking Streaming Transformer (BEAST), an online joint beat and downbeat tracking system based on the streaming Transformer. To deal with online scenarios, BEAST applies contextual block processing in the Transformer encoder. Moreover, we adopt relative positional encoding in the attention layer of the streaming Transformer encoder to capture relative timing position which is critically important information in music. Carrying out beat and downbeat experiments on benchmark datasets for a low latency scenario with maximum latency under 50 ms, BEAST achieves an F1-measure of 80.04% in beat and 46.78% in downbeat, which is a substantial improvement of about 5 percentage points over the state-of-the-art online beat tracking model.

BEAST: Online Joint Beat and Downbeat Tracking Based on Streaming Transformer

TL;DR

BEAST tackles online beat and downbeat tracking by leveraging a streaming Transformer with contextual block processing and relative positional encoding to achieve low-latency inference. The method processes partially available audio frames through block-based attention while preserving long-range dependencies via inherited context and a relative timing bias. Empirical results show BEAST outperforms prior online models, attaining approximately 80% beat F1 and 46.8% downbeat F1 under latency under 50 ms, with robust performance across multiple datasets. This work demonstrates the potential of streaming Transformers in MIR and suggests broad applicability to real-time transcription and accompaniment tasks.

Abstract

Many deep learning models have achieved dominant performance on the offline beat tracking task. However, online beat tracking, in which only the past and present input features are available, still remains challenging. In this paper, we propose BEAt tracking Streaming Transformer (BEAST), an online joint beat and downbeat tracking system based on the streaming Transformer. To deal with online scenarios, BEAST applies contextual block processing in the Transformer encoder. Moreover, we adopt relative positional encoding in the attention layer of the streaming Transformer encoder to capture relative timing position which is critically important information in music. Carrying out beat and downbeat experiments on benchmark datasets for a low latency scenario with maximum latency under 50 ms, BEAST achieves an F1-measure of 80.04% in beat and 46.78% in downbeat, which is a substantial improvement of about 5 percentage points over the state-of-the-art online beat tracking model.
Paper Structure (12 sections, 5 equations, 2 figures, 2 tables)

This paper contains 12 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The contextual block processing encoder.
  • Figure 2: The complete model architecture of BEAST.