SynthCloner: Synthesizer-style Audio Transfer via Factorized Codec with ADSR Envelope Control
Jeng-Yue Liu, Ting-Chao Hsu, Yen-Tung Yeh, Li Su, Yi-Hsuan Yang
TL;DR
SynthCloner presents a factorized codec that disentangles ADSR envelopes, timbre, and content to enable controllable synthesizer-style audio transfer. Built atop SynthCAT, a 3M-sample dataset spanning 250 timbres, 120 ADSR envelopes, and 100 MIDI files, the model uses perturbation-based disentanglement and auxiliary tasks to achieve faithful transfer of both spectral and temporal characteristics. Empirical results show state-of-the-art performance on objective metrics (MSTFT, LRMSD, F0RMSE) and subjective MOS scores, with an explicit ADSR path proving crucial for quality. The approach offers practical capabilities for expressive sound design and efficient exploration of timbre/envelope combinations in synthetic audio domains.
Abstract
Electronic synthesizer sounds are controlled by parameter settings that yield complex timbral characteristics and ADSR envelopes, making synthesizer-style audio transfer particularly challenging. Recent approaches to timbre transfer often rely on spectral objectives or implicit style matching, offering limited control over envelope shaping. Moreover, public synthesizer datasets rarely provide diverse coverage of timbres and ADSR envelopes. To address these gaps, we present SynthCloner, a factorized codec model that disentangles audio into three attributes: ADSR envelope, timbre, and content. This separation enables expressive audio transfer with independent control over these attributes. Additionally, we introduce SynthCAT, a new synthesizer dataset with a task-specific rendering pipeline covering 250 timbres, 120 ADSR envelopes, and 100 MIDI sequences. Experiments show that SynthCloner outperforms baselines on both objective and subjective metrics, while enabling independent attribute control. The code, model checkpoint, and audio examples are available at https://buffett0323.github.io/synthcloner/.
