Table of Contents
Fetching ...

STSM-FiLM: A FiLM-Conditioned Neural Architecture for Time-Scale Modification of Speech

Dyah A. M. G. Wisnu, Ryandhimas E. Zezario, Stefano Rini, Fo-Rui Li, Yan-Tsung Peng, Hsin-Min Wang, Yu Tsao

TL;DR

This work tackles time-scale modification of speech by addressing artifacts from traditional methods and limited neural conditioning. It introduces STSM-FiLM, a fully neural FiLM-conditioned architecture that uses a continuous speed factor $\alpha$ and WSOLA-generated supervision to learn alignment and synthesis, applicable across four encoder–decoder backends. Experiments show that FiLM conditioning improves generalization across speed factors, with STFT-based decoders delivering higher fidelity and WavLM-based decoders delivering stronger naturalness and ASR compatibility; subjective listening tests also favor WavLM-HiFiGAN. Overall, FiLM-based conditioning offers robust, continuous control for neural TSM with promising practical implications for perceptual quality and intelligibility.

Abstract

Time-Scale Modification (TSM) of speech aims to alter the playback rate of audio without changing its pitch. While classical methods like Waveform Similarity-based Overlap-Add (WSOLA) provide strong baselines, they often introduce artifacts under non-stationary or extreme stretching conditions. We propose STSM-FILM - a fully neural architecture that incorporates Feature-Wise Linear Modulation (FiLM) to condition the model on a continuous speed factor. By supervising the network using WSOLA-generated outputs, STSM-FILM learns to mimic alignment and synthesis behaviors while benefiting from representations learned through deep learning. We explore four encoder-decoder variants: STFT-HiFiGAN, WavLM-HiFiGAN, Whisper-HiFiGAN, and EnCodec, and demonstrate that STSM-FILM is capable of producing perceptually consistent outputs across a wide range of time-scaling factors. Overall, our results demonstrate the potential of FiLM-based conditioning to improve the generalization and flexibility of neural TSM models.

STSM-FiLM: A FiLM-Conditioned Neural Architecture for Time-Scale Modification of Speech

TL;DR

This work tackles time-scale modification of speech by addressing artifacts from traditional methods and limited neural conditioning. It introduces STSM-FiLM, a fully neural FiLM-conditioned architecture that uses a continuous speed factor and WSOLA-generated supervision to learn alignment and synthesis, applicable across four encoder–decoder backends. Experiments show that FiLM conditioning improves generalization across speed factors, with STFT-based decoders delivering higher fidelity and WavLM-based decoders delivering stronger naturalness and ASR compatibility; subjective listening tests also favor WavLM-HiFiGAN. Overall, FiLM-based conditioning offers robust, continuous control for neural TSM with promising practical implications for perceptual quality and intelligibility.

Abstract

Time-Scale Modification (TSM) of speech aims to alter the playback rate of audio without changing its pitch. While classical methods like Waveform Similarity-based Overlap-Add (WSOLA) provide strong baselines, they often introduce artifacts under non-stationary or extreme stretching conditions. We propose STSM-FILM - a fully neural architecture that incorporates Feature-Wise Linear Modulation (FiLM) to condition the model on a continuous speed factor. By supervising the network using WSOLA-generated outputs, STSM-FILM learns to mimic alignment and synthesis behaviors while benefiting from representations learned through deep learning. We explore four encoder-decoder variants: STFT-HiFiGAN, WavLM-HiFiGAN, Whisper-HiFiGAN, and EnCodec, and demonstrate that STSM-FILM is capable of producing perceptually consistent outputs across a wide range of time-scaling factors. Overall, our results demonstrate the potential of FiLM-based conditioning to improve the generalization and flexibility of neural TSM models.

Paper Structure

This paper contains 8 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: System architecture of the proposed STSM-FiLM model. The input waveform is encoded into latent feature representations by a feature encoder. The TSM Feature Modulator (FiLM) adapts these features using the speed factor $\alpha$ and pre-trained weights. The decoder/generator then reconstructs the time-scaled waveform from the modulated features.
  • Figure 2: Trends of average objective metrics across speed factors (0.5–2.0) for five systems.