Table of Contents
Fetching ...

Hyper Recurrent Neural Network: Condition Mechanisms for Black-box Audio Effect Modeling

Yen-Tung Yeh, Wen-Yi Hsiao, Yi-Hsuan Yang

TL;DR

Conventional RNN-based virtual-analog modeling often conditions on knobs via simple concatenation, which limits expressive capacity. This work proposes three hypernetwork-based conditioning schemes—FiLM-RNN, StaticHyper-RNN, and DynamicHyper-RNN—to adapt model behavior to control parameters, and introduces a transient reconstruction metric to evaluate short-lived events. Across two devices (LA-2A and OD-3) and several objective metrics, all three methods outperform concatenation, with DynamicHyper-RNN delivering the strongest gains at higher computational cost while StaticHyper-RNN offers substantial compute savings. The study advances black-box audio effect emulation by integrating time-varying, parameter-conditioned weight modulation, and provides open data and code for reproducibility.

Abstract

Recurrent neural networks (RNNs) have demonstrated impressive results for virtual analog modeling of audio effects. These networks process time-domain audio signals using a series of matrix multiplication and nonlinear activation functions to emulate the behavior of the target device accurately. To additionally model the effect of the knobs for an RNN-based model, existing approaches integrate control parameters by concatenating them channel-wisely with some intermediate representation of the input signal. While this method is parameter-efficient, there is room to further improve the quality of generated audio because the concatenation-based conditioning method has limited capacity in modulating signals. In this paper, we propose three novel conditioning mechanisms for RNNs, tailored for black-box virtual analog modeling. These advanced conditioning mechanisms modulate the model based on control parameters, yielding superior results to existing RNN- and CNN-based architectures across various evaluation metrics.

Hyper Recurrent Neural Network: Condition Mechanisms for Black-box Audio Effect Modeling

TL;DR

Conventional RNN-based virtual-analog modeling often conditions on knobs via simple concatenation, which limits expressive capacity. This work proposes three hypernetwork-based conditioning schemes—FiLM-RNN, StaticHyper-RNN, and DynamicHyper-RNN—to adapt model behavior to control parameters, and introduces a transient reconstruction metric to evaluate short-lived events. Across two devices (LA-2A and OD-3) and several objective metrics, all three methods outperform concatenation, with DynamicHyper-RNN delivering the strongest gains at higher computational cost while StaticHyper-RNN offers substantial compute savings. The study advances black-box audio effect emulation by integrating time-varying, parameter-conditioned weight modulation, and provides open data and code for reproducibility.

Abstract

Recurrent neural networks (RNNs) have demonstrated impressive results for virtual analog modeling of audio effects. These networks process time-domain audio signals using a series of matrix multiplication and nonlinear activation functions to emulate the behavior of the target device accurately. To additionally model the effect of the knobs for an RNN-based model, existing approaches integrate control parameters by concatenating them channel-wisely with some intermediate representation of the input signal. While this method is parameter-efficient, there is room to further improve the quality of generated audio because the concatenation-based conditioning method has limited capacity in modulating signals. In this paper, we propose three novel conditioning mechanisms for RNNs, tailored for black-box virtual analog modeling. These advanced conditioning mechanisms modulate the model based on control parameters, yielding superior results to existing RNN- and CNN-based architectures across various evaluation metrics.
Paper Structure (22 sections, 10 equations, 5 figures, 3 tables)

This paper contains 22 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The architecture of the FiLM-RNN, with $\phi$ representing the conditioning vector, $h$ representing the hidden state, $x$ denoting the input signa. The FiLM-ed generator aims to produce scaling and shifting coefficients for feature-wise linear modulation of the feature maps.
  • Figure 2: The architecture of the StaticHyper-RNN, with $\phi$ representing the conditioning vector, $h$ representing the hidden state, $x$ denoting the input signal. The MLP aims to generate the weight matrix $W_h$ and $W_x$ to perform matrix multiplication.
  • Figure 3: The architecture of the DynamicHyper-RNN mechanism: $\phi$ representing the conditioning vector, $h$ representing the hidden state of the mainRNN, $x$ denoting the input signal, and $\hat{h}$ representing the hidden state of the hyperRNN. The hyperRNN generates the feature $Z_{o}$, then learns an additional transformation to modulate the output of the feature map generated from the input $h$ and $x$.
  • Figure 4: The diagram illustrates the proposed transient metric. The blue color represents the signal in the time domain, while the orange color signifies the signal in the discrete cosine transform (DCT) domain. The algorithm extracts the transient signal and calculates the spectral loss in the DCT domain.
  • Figure 5: The diagram illustrates the spectrum difference observed in the Boss OD-3 test clips. All the proposed conditioning methods yield superior results compared to the Concatenation method.