STSM-FiLM: A FiLM-Conditioned Neural Architecture for Time-Scale Modification of Speech
Dyah A. M. G. Wisnu, Ryandhimas E. Zezario, Stefano Rini, Fo-Rui Li, Yan-Tsung Peng, Hsin-Min Wang, Yu Tsao
TL;DR
This work tackles time-scale modification of speech by addressing artifacts from traditional methods and limited neural conditioning. It introduces STSM-FiLM, a fully neural FiLM-conditioned architecture that uses a continuous speed factor $\alpha$ and WSOLA-generated supervision to learn alignment and synthesis, applicable across four encoder–decoder backends. Experiments show that FiLM conditioning improves generalization across speed factors, with STFT-based decoders delivering higher fidelity and WavLM-based decoders delivering stronger naturalness and ASR compatibility; subjective listening tests also favor WavLM-HiFiGAN. Overall, FiLM-based conditioning offers robust, continuous control for neural TSM with promising practical implications for perceptual quality and intelligibility.
Abstract
Time-Scale Modification (TSM) of speech aims to alter the playback rate of audio without changing its pitch. While classical methods like Waveform Similarity-based Overlap-Add (WSOLA) provide strong baselines, they often introduce artifacts under non-stationary or extreme stretching conditions. We propose STSM-FILM - a fully neural architecture that incorporates Feature-Wise Linear Modulation (FiLM) to condition the model on a continuous speed factor. By supervising the network using WSOLA-generated outputs, STSM-FILM learns to mimic alignment and synthesis behaviors while benefiting from representations learned through deep learning. We explore four encoder-decoder variants: STFT-HiFiGAN, WavLM-HiFiGAN, Whisper-HiFiGAN, and EnCodec, and demonstrate that STSM-FILM is capable of producing perceptually consistent outputs across a wide range of time-scaling factors. Overall, our results demonstrate the potential of FiLM-based conditioning to improve the generalization and flexibility of neural TSM models.
