Table of Contents
Fetching ...

Sines, Transient, Noise Neural Modeling of Piano Notes

Riccardo Simionato, Stefano Fasciani

TL;DR

A novel method for emulating piano sounds by exploiting the sine, transient, and noise decomposition to design a differentiable spectral modeling synthesizer replicating piano notes and achieves perceptual accuracy in emulating single notes and trichords.

Abstract

This paper introduces a novel method for emulating piano sounds. We propose to exploit the sines, transient, and noise decomposition to design a differentiable spectral modeling synthesizer replicating piano notes. Three sub-modules learn these components from piano recordings and generate the corresponding harmonic, transient, and noise signals. Splitting the emulation into three independently trainable models reduces the modeling tasks' complexity. The quasi-harmonic content is produced using a differentiable sinusoidal model guided by physics-derived formulas, whose parameters are automatically estimated from audio recordings. The noise sub-module uses a learnable time-varying filter, and the transients are generated using a deep convolutional network. From singular notes, we emulate the coupling between different keys in trichords with a convolutional-based network. Results show the model matches the partial distribution of the target while predicting the energy in the higher part of the spectrum presents more challenges. The energy distribution in the spectra of the transient and noise components is accurate overall. While the model is more computationally and memory efficient, perceptual tests reveal limitations in accurately modeling the attack phase of notes. Despite this, it generally achieves perceptual accuracy in emulating single notes and trichords.

Sines, Transient, Noise Neural Modeling of Piano Notes

TL;DR

A novel method for emulating piano sounds by exploiting the sine, transient, and noise decomposition to design a differentiable spectral modeling synthesizer replicating piano notes and achieves perceptual accuracy in emulating single notes and trichords.

Abstract

This paper introduces a novel method for emulating piano sounds. We propose to exploit the sines, transient, and noise decomposition to design a differentiable spectral modeling synthesizer replicating piano notes. Three sub-modules learn these components from piano recordings and generate the corresponding harmonic, transient, and noise signals. Splitting the emulation into three independently trainable models reduces the modeling tasks' complexity. The quasi-harmonic content is produced using a differentiable sinusoidal model guided by physics-derived formulas, whose parameters are automatically estimated from audio recordings. The noise sub-module uses a learnable time-varying filter, and the transients are generated using a deep convolutional network. From singular notes, we emulate the coupling between different keys in trichords with a convolutional-based network. Results show the model matches the partial distribution of the target while predicting the energy in the higher part of the spectrum presents more challenges. The energy distribution in the spectra of the transient and noise components is accurate overall. While the model is more computationally and memory efficient, perceptual tests reveal limitations in accurately modeling the attack phase of notes. Despite this, it generally achieves perceptual accuracy in emulating single notes and trichords.
Paper Structure (18 sections, 18 equations, 13 figures, 2 tables)

This paper contains 18 sections, 18 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: The Quasi-Harmonic Model consists of $3$ layers: the Inharmonicity layer (a), the Damping layer (b), and the Sine Generator layer (c). The Inharmonicity layer computes the partial distribution for vertical and horizontal polarizations, the Damping layer predicts the partial amplitudes, while the Sine Generator layer computes and sums together all the sine components.
  • Figure 2: The layers composing the quasi-Harmonic Model. The Inharmonicity layer (left) takes the inharmonicity factor $B$, which is a learnable parameter, and computes the partial distribution for the vertical polarization $f^v_m$ from the input frequency. At the same time, the input velocity is fed to a feedforward network, which is used to detune the input frequency and predict the partial distribution for the horizontal polarization $f^h_m$. The Damping Layer (right) predicts the damping coefficients that govern the partial decaying for both polarizations. In this case, the input is the velocity and an index that indicates how many inference iterations have passed.
  • Figure 3: The transient model takes the velocity of the note as input and, using a stack of upsampling and convolutional layers, generates the waveform. The resulting waveform is inversely discrete cosine transformed to compute $y_{trans}$.
  • Figure 4: The noise is modeled generating noise filter magnitudes $\boldsymbol{\eta}$ that is convolved in the frequency domain with a generated white noise. The input vector for all the layers consists of the velocity $v_n$, and time index $i_n$.
  • Figure 5: The coupling among different keys is modeled by a convolutional neural network and temporal FiLM method combined with the GLU to condition the network based on the keys playing. The input vector is the sum of the three separate note sounds, while their frequencies, velocities, and time index compose the conditioning vector.
  • ...and 8 more figures