Table of Contents
Fetching ...

INSTRUMENTAL: Automatic Synthesizer Parameter Recovery from Audio via Evolutionary Optimization

Philipp Bogdan

Abstract

Existing audio-to-MIDI tools extract notes but discard the timbral characteristics that define an instrument's identity. We present Instrumental, a system that recovers continuous synthesizer parameters from audio by coupling a differentiable 28-parameter subtractive synthesizer with CMA-ES, a derivative-free evolutionary optimizer. We optimize a composite perceptual loss combining mel-scaled STFT, spectral centroid, and MFCC divergence, achieving a matching loss of 2.09 on real recorded audio. We systematically evaluate eight hypotheses for improving convergence and find that only parametric EQ boosting yields meaningful improvement. Our results show that CMA-ES outperforms gradient descent on this non-convex landscape, that more parameters do not monotonically improve matching, and that spectral analysis initialization accelerates convergence over random starts.

INSTRUMENTAL: Automatic Synthesizer Parameter Recovery from Audio via Evolutionary Optimization

Abstract

Existing audio-to-MIDI tools extract notes but discard the timbral characteristics that define an instrument's identity. We present Instrumental, a system that recovers continuous synthesizer parameters from audio by coupling a differentiable 28-parameter subtractive synthesizer with CMA-ES, a derivative-free evolutionary optimizer. We optimize a composite perceptual loss combining mel-scaled STFT, spectral centroid, and MFCC divergence, achieving a matching loss of 2.09 on real recorded audio. We systematically evaluate eight hypotheses for improving convergence and find that only parametric EQ boosting yields meaningful improvement. Our results show that CMA-ES outperforms gradient descent on this non-convex landscape, that more parameters do not monotonically improve matching, and that spectral analysis initialization accelerates convergence over random starts.
Paper Structure (12 sections, 2 equations, 6 figures, 2 tables)

This paper contains 12 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Pipeline: audio $\to$ Demucs separation $\to$ pitch detection $\to$ CMA-ES optimization loop $\to$ output parameters.
  • Figure 2: CMA-ES convergence. 90% of improvement occurs in the first 10K evaluations (shaded). The remaining 90K evals yield only 0.07 loss reduction.
  • Figure 3: Harmonic amplitudes: original vs. matched. H1--H3 are closely reproduced; H4--H8 diverge by 2--5$\times$, revealing the architectural floor of subtractive synthesis.
  • Figure 4: Landing page. Users drop an audio file (MP3 or WAV) and select Single Sound or Sequence mode.
  • Figure 5: Optimization in progress. The progress bar, evaluation count, current best loss, and elapsed time update live via WebSocket as CMA-ES runs across 7 CPU cores.
  • ...and 1 more figures