Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis

Chin-Yun Yu; György Fazekas

Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis

Chin-Yun Yu, György Fazekas

TL;DR

It is shown that the improved GOLF learns LP coefficients and reconstructs the voice better than its frame-wise counterparts, and synthesised outputs from GOLF scored higher in quality ratings than the state-of-the-art differentiable WORLD vocoder.

Abstract

Training the linear prediction (LP) operator end-to-end for audio synthesis in modern deep learning frameworks is slow due to its recursive formulation. In addition, frame-wise approximation as an acceleration method cannot generalise well to test time conditions where the LP is computed sample-wise. Efficient differentiable sample-wise LP for end-to-end training is the key to removing this barrier. We generalise the efficient time-invariant LP implementation from the GOLF vocoder to time-varying cases. Combining this with the classic source-filter model, we show that the improved GOLF learns LP coefficients and reconstructs the voice better than its frame-wise counterparts. Moreover, in our listening test, synthesised outputs from GOLF scored higher in quality ratings than the state-of-the-art differentiable WORLD vocoder.

Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis

TL;DR

Abstract

Paper Structure (19 sections, 10 equations, 3 figures, 1 table)

This paper contains 19 sections, 10 equations, 3 figures, 1 table.

Introduction
Background
Harmonic-plus-noise GOLF
Differentiable time-invariant LP
Methodology
Source-filter GOLF
Differentiable time-varying LP
The gradients to $e(t)$
The gradients to $\Tilde{\bf a}(t)$
Experiment
Dataset and training configurations
Speaker-independent encoder
Decoders
Evaluations and discussions
Objective evaluations
...and 4 more sections

Figures (3)

Figure 1: Flow diagram of the proposed experiment. For DDSP, the $G(z)H(z)$ is jointly modelled using an additive synthesiser.
Figure 2: The running spectra converted from the encoded LPCs using 0.4 seconds of speech from speaker p360. The rightmost LPCs are computed using the auto-correlation method from SPTK with the same filter order as GOLFs.
Figure 3: The average ratings of each speaker with a 95% confidence interval.

Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis

TL;DR

Abstract

Differentiable Time-Varying Linear Prediction in the Context of End-to-End Analysis-by-Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (3)