Table of Contents
Fetching ...

Modeling Time-Variant Responses of Optical Compressors with Selective State Space Models

Riccardo Simionato, Stefano Fasciani

TL;DR

This work tackles the challenge of accurately emulating optical dynamic range compressors with a neural, low-latency approach. It introduces Selective State Space (S6) networks augmented with FiLM and TemporalFiLM conditioning to capture both the magnitude-driven compression and the device-specific timing dynamics, enabling per-sample output with minimal latency ($y_n = g x_n$, using a 64-sample input window). Across hardware (LA-2A, TubeTech CL 1B) and software emulations, the S6-based architecture outperforms LSTM, ED, S4D, and TCN baselines on multiple objective metrics and perceptual tests, especially for time-variant behaviors. The results underscore the model’s ability to generalize to unseen parameter settings and highlight the nuanced impact of control-parameter sampling density on interpolation accuracy. Overall, the method offers a practical path to real-time, high-fidelity emulation of analog optical dynamics with potential applicability to other time-variant audio effects.

Abstract

This paper presents a method for modeling optical dynamic range compressors using deep neural networks with Selective State Space models. The proposed approach surpasses previous methods based on recurrent layers by employing a Selective State Space block to encode the input audio. It features a refined technique integrating Feature-wise Linear Modulation and Gated Linear Units to adjust the network dynamically, conditioning the compression's attack and release phases according to external parameters. The proposed architecture is well-suited for low-latency and real-time applications, crucial in live audio processing. The method has been validated on the analog optical compressors TubeTech CL 1B and Teletronix LA-2A, which possess distinct characteristics. Evaluation is performed using quantitative metrics and subjective listening tests, comparing the proposed method with other state-of-the-art models. Results show that our black-box modeling methods outperform all others, achieving accurate emulation of the compression process for both seen and unseen settings during training. We further show a correlation between this accuracy and the sampling density of the control parameters in the dataset and identify settings with fast attack and slow release as the most challenging to emulate.

Modeling Time-Variant Responses of Optical Compressors with Selective State Space Models

TL;DR

This work tackles the challenge of accurately emulating optical dynamic range compressors with a neural, low-latency approach. It introduces Selective State Space (S6) networks augmented with FiLM and TemporalFiLM conditioning to capture both the magnitude-driven compression and the device-specific timing dynamics, enabling per-sample output with minimal latency (, using a 64-sample input window). Across hardware (LA-2A, TubeTech CL 1B) and software emulations, the S6-based architecture outperforms LSTM, ED, S4D, and TCN baselines on multiple objective metrics and perceptual tests, especially for time-variant behaviors. The results underscore the model’s ability to generalize to unseen parameter settings and highlight the nuanced impact of control-parameter sampling density on interpolation accuracy. Overall, the method offers a practical path to real-time, high-fidelity emulation of analog optical dynamics with potential applicability to other time-variant audio effects.

Abstract

This paper presents a method for modeling optical dynamic range compressors using deep neural networks with Selective State Space models. The proposed approach surpasses previous methods based on recurrent layers by employing a Selective State Space block to encode the input audio. It features a refined technique integrating Feature-wise Linear Modulation and Gated Linear Units to adjust the network dynamically, conditioning the compression's attack and release phases according to external parameters. The proposed architecture is well-suited for low-latency and real-time applications, crucial in live audio processing. The method has been validated on the analog optical compressors TubeTech CL 1B and Teletronix LA-2A, which possess distinct characteristics. Evaluation is performed using quantitative metrics and subjective listening tests, comparing the proposed method with other state-of-the-art models. Results show that our black-box modeling methods outperform all others, achieving accurate emulation of the compression process for both seen and unseen settings during training. We further show a correlation between this accuracy and the sampling density of the control parameters in the dataset and identify settings with fast attack and slow release as the most challenging to emulate.
Paper Structure (15 sections, 10 equations, 11 figures, 9 tables)

This paper contains 15 sections, 10 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Proposed architecture: the input $\boldsymbol{x}$, which contains the current and past samples, is fed to a linear FC layer and subsequently to a S6 block (detailed in Figure \ref{['fig:Ma']}). The resulting vector passes through the conditioning layer and another identical S6 block before reaching the output layer, which is a linear FC layer with one unit. The number of units (u) is indicated next to each layer. The output is a coefficient $g$ which multiplied by the current input sample $x_n$ yields the current output sample $y_n$
  • Figure 2: Internal architecture of the S6 block featured in the in architecture show in Figure \ref{['fig:arch']}. The first FC layer has a number of units equal to twice the length of the input dimension and, in turn, is used to create a double projection of its input. The projection is split into two equal-sized vectors, one passing the convolutional layer and, afterward, the swish function and, finally, the S6 layer; the other is element-wise multiplied by the output of the S6 layer after the swish function. The result from the residual connection feeds a FC layer with a number of units equal to the length of the block input vector. Next to the layers are the reported number of units (u), activation functions, and kernel (k) when applicable.
  • Figure 3: Internal architecture of the conditioning block featured in the architecture shown in Figure \ref{['fig:arch']}. The device control parameters, conditioning the model, are split into two vectors: those influencing the amount of compression $\boldsymbol{p_{co}}$ and those determining the timing behavior $\boldsymbol{p_{ti}}$. The FiLM method, followed by a GLU featuring the softsign function, is used for $\boldsymbol{p_{co}}$. Similarly, the temporal FiLM method, followed by a GLU featuring the softsign function, is used for $\boldsymbol{p_{ti}}$. In both cases, the vector $\boldsymbol{f}$ computed by a convolutional layer, which takes the magnitude spectrum of the input $\boldsymbol{x}$ is concatenated to $\boldsymbol{p_{co}}$ and $\boldsymbol{p_{ti}}$. In the case of the LA-2A, $\boldsymbol{p_{co}}$ corresponds to the peak reduction, while $\boldsymbol{p_{ti}}$ pertains to the switch mode. For the CL 1B, $\boldsymbol{p_{co}}$ includes both the threshold and ratio, whereas $\boldsymbol{p_{ti}}$ encompasses the attack and release times. Next to the layers are the reported number of units (u) and kernel (k) when applicable.
  • Figure 4: Recurrent blocks used as alternative to the S6 block in the Figure \ref{['fig:arch']} architecture for comparing performances. From left to right: S4D, ED, and LSTM blocks. In the ED block, the convolutional layer is included only before the conditioning block. Next to the layers are the reported number of units (u), activation function, and kernel (k) when applicable.
  • Figure 5: TubeTech CL 1B (top) and Teletronix LA-2A (bottom) optical range dynamic compressors used to collect the dataset for the experiments.
  • ...and 6 more figures