Table of Contents
Fetching ...

CONMOD: Controllable Neural Frame-based Modulation Effects

Gyubin Lee, Hounsu Kim, Junwon Lee, Juhan Nam

TL;DR

CONMOD addresses the lack of controllability in neural modelling of LFO-driven audio effects by predicting a frame-wise transfer function conditioned on LFO frequency and feedback. The approach combines an LSTM on the LFO with an MLP and FiLM conditioning to enable continuous control and to learn a shared embedding space for multiple phaser effects, including steerability between two distinct phasers. Through a multi-LFO training regime and a chirp-based training protocol, CONMOD achieves superior accuracy over a prior baseline and demonstrates robustness to unseen control settings and long audio sequences. This work enables flexible, creative neural emulations of LFO-based modulation with potential for broader universal modelling of time-varying audio effects.

Abstract

Deep learning models have seen widespread use in modelling LFO-driven audio effects, such as phaser and flanger. Although existing neural architectures exhibit high-quality emulation of individual effects, they do not possess the capability to manipulate the output via control parameters. To address this issue, we introduce Controllable Neural Frame-based Modulation Effects (CONMOD), a single black-box model which emulates various LFO-driven effects in a frame-wise manner, offering control over LFO frequency and feedback parameters. Additionally, the model is capable of learning the continuous embedding space of two distinct phaser effects, enabling us to steer between effects and achieve creative outputs. Our model outperforms previous work while possessing both controllability and universality, presenting opportunities to enhance creativity in modern LFO-driven audio effects.

CONMOD: Controllable Neural Frame-based Modulation Effects

TL;DR

CONMOD addresses the lack of controllability in neural modelling of LFO-driven audio effects by predicting a frame-wise transfer function conditioned on LFO frequency and feedback. The approach combines an LSTM on the LFO with an MLP and FiLM conditioning to enable continuous control and to learn a shared embedding space for multiple phaser effects, including steerability between two distinct phasers. Through a multi-LFO training regime and a chirp-based training protocol, CONMOD achieves superior accuracy over a prior baseline and demonstrates robustness to unseen control settings and long audio sequences. This work enables flexible, creative neural emulations of LFO-based modulation with potential for broader universal modelling of time-varying audio effects.

Abstract

Deep learning models have seen widespread use in modelling LFO-driven audio effects, such as phaser and flanger. Although existing neural architectures exhibit high-quality emulation of individual effects, they do not possess the capability to manipulate the output via control parameters. To address this issue, we introduce Controllable Neural Frame-based Modulation Effects (CONMOD), a single black-box model which emulates various LFO-driven effects in a frame-wise manner, offering control over LFO frequency and feedback parameters. Additionally, the model is capable of learning the continuous embedding space of two distinct phaser effects, enabling us to steer between effects and achieve creative outputs. Our model outperforms previous work while possessing both controllability and universality, presenting opportunities to enhance creativity in modern LFO-driven audio effects.
Paper Structure (25 sections, 4 equations, 11 figures, 4 tables)

This paper contains 25 sections, 4 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Overall model architecture. Blocks that have both dotted lines and solid lines as input indicate that the input data is distinctive for the training phase and inference phase. Grey-colored blocks indicate they have no trainable parameters and their processes are differentiable, thus enabling backpropagation. $\mathbf{z_a}$, $\mathbf{z_b}$, and $\mathbf{c_{emb}}$ are trainable parameters.
  • Figure 2: Proposed model training technique on multiple LFO frequency controls. For a random input-output pair with specified LFO frequency settings, only the corresponding $z_a$, $z_b$ parameters are optimized. Figure depicts the case when $z_a^2$, $z_b^2$ are optimized. Other parameters are not updated by employing stop gradient operation.
  • Figure 3: ESR(%) results for model trained on the digital phaser under seen and unseen LFO frequencies and Feedback parameters. Seen cases are labeled black, and orange for unseen cases.
  • Figure 4: ESR(%) results for model trained on the analog phaser under seen LFO frequencies and Color parameters.
  • Figure 5: ESR(%) results for model trained on the analog phaser under seen and unseen LFO frequencies and Color parameters. Seen cases are labeled black, and orange for unseen cases. Color is controlled over a fixed LFO frequency (Top). LFO frequency is controlled over a fixed Color rate (Bottom).
  • ...and 6 more figures