STSM-FiLM: A FiLM-Conditioned Neural Architecture for Time-Scale Modification of Speech

Dyah A. M. G. Wisnu; Ryandhimas E. Zezario; Stefano Rini; Fo-Rui Li; Yan-Tsung Peng; Hsin-Min Wang; Yu Tsao

STSM-FiLM: A FiLM-Conditioned Neural Architecture for Time-Scale Modification of Speech

Dyah A. M. G. Wisnu, Ryandhimas E. Zezario, Stefano Rini, Fo-Rui Li, Yan-Tsung Peng, Hsin-Min Wang, Yu Tsao

TL;DR

This work tackles time-scale modification of speech by addressing artifacts from traditional methods and limited neural conditioning. It introduces STSM-FiLM, a fully neural FiLM-conditioned architecture that uses a continuous speed factor $\alpha$ and WSOLA-generated supervision to learn alignment and synthesis, applicable across four encoder–decoder backends. Experiments show that FiLM conditioning improves generalization across speed factors, with STFT-based decoders delivering higher fidelity and WavLM-based decoders delivering stronger naturalness and ASR compatibility; subjective listening tests also favor WavLM-HiFiGAN. Overall, FiLM-based conditioning offers robust, continuous control for neural TSM with promising practical implications for perceptual quality and intelligibility.

Abstract

Time-Scale Modification (TSM) of speech aims to alter the playback rate of audio without changing its pitch. While classical methods like Waveform Similarity-based Overlap-Add (WSOLA) provide strong baselines, they often introduce artifacts under non-stationary or extreme stretching conditions. We propose STSM-FILM - a fully neural architecture that incorporates Feature-Wise Linear Modulation (FiLM) to condition the model on a continuous speed factor. By supervising the network using WSOLA-generated outputs, STSM-FILM learns to mimic alignment and synthesis behaviors while benefiting from representations learned through deep learning. We explore four encoder-decoder variants: STFT-HiFiGAN, WavLM-HiFiGAN, Whisper-HiFiGAN, and EnCodec, and demonstrate that STSM-FILM is capable of producing perceptually consistent outputs across a wide range of time-scaling factors. Overall, our results demonstrate the potential of FiLM-based conditioning to improve the generalization and flexibility of neural TSM models.

STSM-FiLM: A FiLM-Conditioned Neural Architecture for Time-Scale Modification of Speech

TL;DR

Abstract

STSM-FiLM: A FiLM-Conditioned Neural Architecture for Time-Scale Modification of Speech

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)