Efficient Vocal Source Separation Through Windowed Sink Attention
Christodoulos Benetatos, Yongyi Zang, Randal Leistikow
TL;DR
The paper tackles the high compute cost of full temporal attention in vocal separation by revealing that temporal attention is largely local. It introduces Windowed Sink Attention (WSA), combining a small local window with global sink tokens, and implements it via flex_attention for efficiency. Through knowledge distillation from a MB-R teacher, the WSA student recovers about $92\%$ of the original SDR while achieving a $44.5\times$ reduction in attention FLOPs on 8-second inputs, demonstrating a favorable efficiency-accuracy trade-off. The results suggest that exploiting modality-specific structure—local temporal dependencies with sparse global context—enables scalable, edge-friendly vocal separation models, with code released under MIT license.
Abstract
State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunking and windowing approaches. Through analysis of a pre-trained vocal separation model, we discovered that temporal attention patterns are highly localized. Building on this insight, we replaced full attention with windowed sink attention (WSA) with small temporal attention window and attention sinks. We show empirically that fine-tuning from the original checkpoint recovers 92% of the original SDR performance while reducing FLOPs by 44.5x. We release our code and checkpoints under MIT license at https://github.com/smulelabs/windowed-roformer.
