SELEBI: Percussion-aware Time Stretching via Selective Magnitude Spectrogram Compression by Nonstationary Gabor Transform
Natsuki Akaishi, Nicki Holighaus, Kohei Yatabe
TL;DR
This work tackles percussion smearing in phase vocoder–based time stretching by addressing a fundamental phase–magnitude mismatch. It introduces SELEBI, a signal-adaptive method that uses the Nonstationary Gabor Transform to selectively compress the magnitude spectrogram around percussive events, yielding temporally localized transients while preserving energy and enabling stable reconstruction. By combining onset-based compression rates with adaptive window lengths and hop sizes, SELEBI achieves sharpened transients and preserved tonal content, outperforming classical PV methods and other percussion-aware approaches, particularly at high stretch factors. The results demonstrate substantial improvements in both objective spectral accuracy and subjective perceptual quality, with potential for online bounded-delay deployment in real-time audio processing.
Abstract
Phase vocoder-based time-stretching is a widely used technique for the time-scale modification of audio signals. However, conventional implementations suffer from ``percussion smearing,'' a well-known artifact that significantly degrades the quality of percussive components. We attribute this artifact to a fundamental time-scale mismatch between the temporally smeared magnitude spectrogram and the localized, newly generated phase. To address this, we propose SELEBI, a signal-adaptive phase vocoder algorithm that significantly reduces percussion smearing while preserving stability and the perfect reconstruction property. Unlike conventional methods that rely on heuristic processing or component separation, our approach leverages the nonstationary Gabor transform. By dynamically adapting analysis window lengths to assign short windows to intervals containing significant energy associated with percussive components, we directly compute a temporally localized magnitude spectrogram from the time-domain signal. This approach ensures greater consistency between the temporal structures of the magnitude and phase. Furthermore, the perfect reconstruction property of the nonstationary Gabor transform guarantees stable, high-fidelity signal synthesis, in contrast to previous heuristic approaches. Experimental results demonstrate that the proposed method effectively mitigates percussion smearing and yields natural sound quality.
