Table of Contents
Fetching ...

SynthCloner: Synthesizer-style Audio Transfer via Factorized Codec with ADSR Envelope Control

Jeng-Yue Liu, Ting-Chao Hsu, Yen-Tung Yeh, Li Su, Yi-Hsuan Yang

TL;DR

SynthCloner presents a factorized codec that disentangles ADSR envelopes, timbre, and content to enable controllable synthesizer-style audio transfer. Built atop SynthCAT, a 3M-sample dataset spanning 250 timbres, 120 ADSR envelopes, and 100 MIDI files, the model uses perturbation-based disentanglement and auxiliary tasks to achieve faithful transfer of both spectral and temporal characteristics. Empirical results show state-of-the-art performance on objective metrics (MSTFT, LRMSD, F0RMSE) and subjective MOS scores, with an explicit ADSR path proving crucial for quality. The approach offers practical capabilities for expressive sound design and efficient exploration of timbre/envelope combinations in synthetic audio domains.

Abstract

Electronic synthesizer sounds are controlled by parameter settings that yield complex timbral characteristics and ADSR envelopes, making synthesizer-style audio transfer particularly challenging. Recent approaches to timbre transfer often rely on spectral objectives or implicit style matching, offering limited control over envelope shaping. Moreover, public synthesizer datasets rarely provide diverse coverage of timbres and ADSR envelopes. To address these gaps, we present SynthCloner, a factorized codec model that disentangles audio into three attributes: ADSR envelope, timbre, and content. This separation enables expressive audio transfer with independent control over these attributes. Additionally, we introduce SynthCAT, a new synthesizer dataset with a task-specific rendering pipeline covering 250 timbres, 120 ADSR envelopes, and 100 MIDI sequences. Experiments show that SynthCloner outperforms baselines on both objective and subjective metrics, while enabling independent attribute control. The code, model checkpoint, and audio examples are available at https://buffett0323.github.io/synthcloner/.

SynthCloner: Synthesizer-style Audio Transfer via Factorized Codec with ADSR Envelope Control

TL;DR

SynthCloner presents a factorized codec that disentangles ADSR envelopes, timbre, and content to enable controllable synthesizer-style audio transfer. Built atop SynthCAT, a 3M-sample dataset spanning 250 timbres, 120 ADSR envelopes, and 100 MIDI files, the model uses perturbation-based disentanglement and auxiliary tasks to achieve faithful transfer of both spectral and temporal characteristics. Empirical results show state-of-the-art performance on objective metrics (MSTFT, LRMSD, F0RMSE) and subjective MOS scores, with an explicit ADSR path proving crucial for quality. The approach offers practical capabilities for expressive sound design and efficient exploration of timbre/envelope combinations in synthetic audio domains.

Abstract

Electronic synthesizer sounds are controlled by parameter settings that yield complex timbral characteristics and ADSR envelopes, making synthesizer-style audio transfer particularly challenging. Recent approaches to timbre transfer often rely on spectral objectives or implicit style matching, offering limited control over envelope shaping. Moreover, public synthesizer datasets rarely provide diverse coverage of timbres and ADSR envelopes. To address these gaps, we present SynthCloner, a factorized codec model that disentangles audio into three attributes: ADSR envelope, timbre, and content. This separation enables expressive audio transfer with independent control over these attributes. Additionally, we introduce SynthCAT, a new synthesizer dataset with a task-specific rendering pipeline covering 250 timbres, 120 ADSR envelopes, and 100 MIDI sequences. Experiments show that SynthCloner outperforms baselines on both objective and subjective metrics, while enabling independent attribute control. The code, model checkpoint, and audio examples are available at https://buffett0323.github.io/synthcloner/.

Paper Structure

This paper contains 13 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Data rendering pipeline of SynthCAT. A sustained-note segment is cropped, mapped to the MIDI sequence through pitch shifting and duration alignment, and then shaped with the ADSR envelope on a per-note basis to produce the final audio.
  • Figure 2: SynthCloner model architecture. To perform SAT, replace $\mathbf{x}_\text{e}$ and $\mathbf{x}_\text{t}$ with the reference audio.