Table of Contents
Fetching ...

Empirical Results for Adjusting Truncated Backpropagation Through Time while Training Neural Audio Effects

Yann Bourdin, Pierrick Legrand, Fanny Roche

TL;DR

This paper empirically analyzes how truncated backpropagation through time (TBPTT) hyperparameters—number of sub-sequences N, batch size B, and sub-sequence length L—affect training dynamics and performance in a convolutional‑recurrent neural model for dynamic range compression (DRC). Using the SPTMod architecture with a State Prediction Network (SPN) and FiLM/TFiLM conditioning, the authors show that careful TBPTT tuning improves accuracy and stability and can reduce compute, validated by objective losses and a perceptual listening test. Across snapshot and full-modeling datasets, larger Lc and B generally boost accuracy and reduce variance, though results depend on data diversity and model configuration; TBPTT enables larger batch sizes and faster convergence. Subjective evaluations confirm that perceptual quality remains high under optimized TBPTT settings, suggesting practical benefits for real-time neural audio effects and informing future multi‑objective hyperparameter optimization and implementation in performant languages like C++.

Abstract

This paper investigates the optimization of Truncated Backpropagation Through Time (TBPTT) for training neural networks in digital audio effect modeling, with a focus on dynamic range compression. The study evaluates key TBPTT hyperparameters -- sequence number, batch size, and sequence length -- and their influence on model performance. Using a convolutional-recurrent architecture, we conduct extensive experiments across datasets with and without conditionning by user controls. Results demonstrate that carefully tuning these parameters enhances model accuracy and training stability, while also reducing computational demands. Objective evaluations confirm improved performance with optimized settings, while subjective listening tests indicate that the revised TBPTT configuration maintains high perceptual quality.

Empirical Results for Adjusting Truncated Backpropagation Through Time while Training Neural Audio Effects

TL;DR

This paper empirically analyzes how truncated backpropagation through time (TBPTT) hyperparameters—number of sub-sequences N, batch size B, and sub-sequence length L—affect training dynamics and performance in a convolutional‑recurrent neural model for dynamic range compression (DRC). Using the SPTMod architecture with a State Prediction Network (SPN) and FiLM/TFiLM conditioning, the authors show that careful TBPTT tuning improves accuracy and stability and can reduce compute, validated by objective losses and a perceptual listening test. Across snapshot and full-modeling datasets, larger Lc and B generally boost accuracy and reduce variance, though results depend on data diversity and model configuration; TBPTT enables larger batch sizes and faster convergence. Subjective evaluations confirm that perceptual quality remains high under optimized TBPTT settings, suggesting practical benefits for real-time neural audio effects and informing future multi‑objective hyperparameter optimization and implementation in performant languages like C++.

Abstract

This paper investigates the optimization of Truncated Backpropagation Through Time (TBPTT) for training neural networks in digital audio effect modeling, with a focus on dynamic range compression. The study evaluates key TBPTT hyperparameters -- sequence number, batch size, and sequence length -- and their influence on model performance. Using a convolutional-recurrent architecture, we conduct extensive experiments across datasets with and without conditionning by user controls. Results demonstrate that carefully tuning these parameters enhances model accuracy and training stability, while also reducing computational demands. Objective evaluations confirm improved performance with optimized settings, while subjective listening tests indicate that the revised TBPTT configuration maintains high perceptual quality.

Paper Structure

This paper contains 20 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: State prediction network (SPN) and SPTMod
  • Figure 2: State Prediction Block (SPN Block, on the left side), and Modulation Block (ModBlock, on the right side).
  • Figure 3: Diagram of intermediary tensor lengths for consecutive (non-overlapping) sequence batches in our TBPTT-based approach with $N=3$. In the first iteration, no padding is applied, so the input length includes the samples needed for temporal operations. In subsequent iterations, states and caches are retained, but their gradients are detached from the computational graph.
  • Figure 4: Loss after training the 6 models on 10 splits each, evaluated on their respective test subsets, depicted by markers.
  • Figure 5: TCN variant in the listening test
  • ...and 2 more figures