Table of Contents
Fetching ...

SELEBI: Percussion-aware Time Stretching via Selective Magnitude Spectrogram Compression by Nonstationary Gabor Transform

Natsuki Akaishi, Nicki Holighaus, Kohei Yatabe

TL;DR

This work tackles percussion smearing in phase vocoder–based time stretching by addressing a fundamental phase–magnitude mismatch. It introduces SELEBI, a signal-adaptive method that uses the Nonstationary Gabor Transform to selectively compress the magnitude spectrogram around percussive events, yielding temporally localized transients while preserving energy and enabling stable reconstruction. By combining onset-based compression rates with adaptive window lengths and hop sizes, SELEBI achieves sharpened transients and preserved tonal content, outperforming classical PV methods and other percussion-aware approaches, particularly at high stretch factors. The results demonstrate substantial improvements in both objective spectral accuracy and subjective perceptual quality, with potential for online bounded-delay deployment in real-time audio processing.

Abstract

Phase vocoder-based time-stretching is a widely used technique for the time-scale modification of audio signals. However, conventional implementations suffer from ``percussion smearing,'' a well-known artifact that significantly degrades the quality of percussive components. We attribute this artifact to a fundamental time-scale mismatch between the temporally smeared magnitude spectrogram and the localized, newly generated phase. To address this, we propose SELEBI, a signal-adaptive phase vocoder algorithm that significantly reduces percussion smearing while preserving stability and the perfect reconstruction property. Unlike conventional methods that rely on heuristic processing or component separation, our approach leverages the nonstationary Gabor transform. By dynamically adapting analysis window lengths to assign short windows to intervals containing significant energy associated with percussive components, we directly compute a temporally localized magnitude spectrogram from the time-domain signal. This approach ensures greater consistency between the temporal structures of the magnitude and phase. Furthermore, the perfect reconstruction property of the nonstationary Gabor transform guarantees stable, high-fidelity signal synthesis, in contrast to previous heuristic approaches. Experimental results demonstrate that the proposed method effectively mitigates percussion smearing and yields natural sound quality.

SELEBI: Percussion-aware Time Stretching via Selective Magnitude Spectrogram Compression by Nonstationary Gabor Transform

TL;DR

This work tackles percussion smearing in phase vocoder–based time stretching by addressing a fundamental phase–magnitude mismatch. It introduces SELEBI, a signal-adaptive method that uses the Nonstationary Gabor Transform to selectively compress the magnitude spectrogram around percussive events, yielding temporally localized transients while preserving energy and enabling stable reconstruction. By combining onset-based compression rates with adaptive window lengths and hop sizes, SELEBI achieves sharpened transients and preserved tonal content, outperforming classical PV methods and other percussion-aware approaches, particularly at high stretch factors. The results demonstrate substantial improvements in both objective spectral accuracy and subjective perceptual quality, with potential for online bounded-delay deployment in real-time audio processing.

Abstract

Phase vocoder-based time-stretching is a widely used technique for the time-scale modification of audio signals. However, conventional implementations suffer from ``percussion smearing,'' a well-known artifact that significantly degrades the quality of percussive components. We attribute this artifact to a fundamental time-scale mismatch between the temporally smeared magnitude spectrogram and the localized, newly generated phase. To address this, we propose SELEBI, a signal-adaptive phase vocoder algorithm that significantly reduces percussion smearing while preserving stability and the perfect reconstruction property. Unlike conventional methods that rely on heuristic processing or component separation, our approach leverages the nonstationary Gabor transform. By dynamically adapting analysis window lengths to assign short windows to intervals containing significant energy associated with percussive components, we directly compute a temporally localized magnitude spectrogram from the time-domain signal. This approach ensures greater consistency between the temporal structures of the magnitude and phase. Furthermore, the perfect reconstruction property of the nonstationary Gabor transform guarantees stable, high-fidelity signal synthesis, in contrast to previous heuristic approaches. Experimental results demonstrate that the proposed method effectively mitigates percussion smearing and yields natural sound quality.
Paper Structure (26 sections, 13 equations, 14 figures, 3 tables, 3 algorithms)

This paper contains 26 sections, 13 equations, 14 figures, 3 tables, 3 algorithms.

Figures (14)

  • Figure 1: Block diagrams of the basic method, PV with identity phase locking laroche2002improved (left), and the proposed method (right). By leveraging NSDGT, the proposed method synthesizes the target signal from T-F representations in which both the magnitude and phase spectrograms of percussive components are highly concentrated in the time direction.
  • Figure 2: Comparison of windows and spectrograms for the DGT and NSDGT. The upper boxes illustrate the window shift in the time domain, where representative DGT windows are highlighted in black for clarity. The bottom boxes display the corresponding spectrograms for (a) DGT with a long window, (b) DGT with a short window, and (c) the NSDGT.
  • Figure 3: Conceptual illustration of time-directional spectrogram "squeezing." (a) The conventional PV-based method using DGT. (b) The proposed method utilizing transient concentration. The top row displays the input time-domain signal (amplitude vs. time) and the analysis window functions. The second and third rows schematically represent the magnitude spectrogram and the corresponding generated phase, respectively. The bottom row shows the synthesized time-stretched signal. In these schematic representations, the percussive component is colored red, and the windows capturing this component are emphasized (non-percussive components are omitted in this row for clarity). In the spectrograms, the red area highlights the percussive component, while the blue and green areas represent the magnitude and phase of the other components, respectively. Because the percussive interval is maintained at its original time-scale, it appears compressed relative to the new, stretched time axis (illustrated below the spectrograms). The bottom panel details the synthesis of the percussive component across the new time frames. The markers $\star 1$ and $\star 2$ highlight the key innovations of the proposed method: shortening the window length and reducing the number of time frames, respectively.
  • Figure 4: The flow of the proposed method. The left column details the NSDGT parameter calculation, while the right column illustrates the subsequent processing steps.
  • Figure 5: Example of the computation of the compression rate. From top left to bottom right, (a) the magnitude spectrogram $|\mathbf{X}|$, (b) the enhancement mask, (c) the enhanced spectrogram $|\mathbf{X}_{\text{p}}|$, (d) the frequency-directional sum of $|\mathbf{X}|$, (e) the frequency-directional sum of $|\mathbf{X}_{\text{p}}|$, and (f) the compression rate $\mathbf{r}$ (the detected peaks are plotted in yellow). The mask in (b) is colored in white where $\mathcal{M}(\mathbf{X},\mathbf{\Phi}_{\textrm{mix}})[m,n] = 1$.
  • ...and 9 more figures