Table of Contents
Fetching ...

Efficient Vocal Source Separation Through Windowed Sink Attention

Christodoulos Benetatos, Yongyi Zang, Randal Leistikow

TL;DR

The paper tackles the high compute cost of full temporal attention in vocal separation by revealing that temporal attention is largely local. It introduces Windowed Sink Attention (WSA), combining a small local window with global sink tokens, and implements it via flex_attention for efficiency. Through knowledge distillation from a MB-R teacher, the WSA student recovers about $92\%$ of the original SDR while achieving a $44.5\times$ reduction in attention FLOPs on 8-second inputs, demonstrating a favorable efficiency-accuracy trade-off. The results suggest that exploiting modality-specific structure—local temporal dependencies with sparse global context—enables scalable, edge-friendly vocal separation models, with code released under MIT license.

Abstract

State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunking and windowing approaches. Through analysis of a pre-trained vocal separation model, we discovered that temporal attention patterns are highly localized. Building on this insight, we replaced full attention with windowed sink attention (WSA) with small temporal attention window and attention sinks. We show empirically that fine-tuning from the original checkpoint recovers 92% of the original SDR performance while reducing FLOPs by 44.5x. We release our code and checkpoints under MIT license at https://github.com/smulelabs/windowed-roformer.

Efficient Vocal Source Separation Through Windowed Sink Attention

TL;DR

The paper tackles the high compute cost of full temporal attention in vocal separation by revealing that temporal attention is largely local. It introduces Windowed Sink Attention (WSA), combining a small local window with global sink tokens, and implements it via flex_attention for efficiency. Through knowledge distillation from a MB-R teacher, the WSA student recovers about of the original SDR while achieving a reduction in attention FLOPs on 8-second inputs, demonstrating a favorable efficiency-accuracy trade-off. The results suggest that exploiting modality-specific structure—local temporal dependencies with sparse global context—enables scalable, edge-friendly vocal separation models, with code released under MIT license.

Abstract

State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunking and windowing approaches. Through analysis of a pre-trained vocal separation model, we discovered that temporal attention patterns are highly localized. Building on this insight, we replaced full attention with windowed sink attention (WSA) with small temporal attention window and attention sinks. We show empirically that fine-tuning from the original checkpoint recovers 92% of the original SDR performance while reducing FLOPs by 44.5x. We release our code and checkpoints under MIT license at https://github.com/smulelabs/windowed-roformer.

Paper Structure

This paper contains 13 sections, 3 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: The framework of both band-split and mel-band roformer architectures. From wang2023mel.
  • Figure 2: Attention patterns in MelBandRoformer transformer layers. Top row: temporal attention shows highly localized patterns concentrated near the diagonal across all layers. We highlight local structure (top right) through 30x30 zoom-in windows. Bottom row: Frequency attention shows more distributed patterns especially in the lower-mid mel bands. BS-R patterns are similar and omitted for space.
  • Figure 3: Objective testing results of original (blue) and modified (red) variants of MB-R models. Best viewed in color.