Table of Contents
Fetching ...

AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder

Samir Sadok, Simon Leglaive, Laurent Girin, Gaël Richard, Xavier Alameda-Pineda

TL;DR

AnCoGen presents a bidirectional masked autoencoder that jointly maps between Mel-spectrograms and high-level speech attributes, enabling simultaneous analysis, manipulation, and synthesis within a single model. By leveraging coupled masking of MS and SA representations and a VQ-VAE–based tokenization, the approach learns rich inter- and intra-representation dependencies and enables robust pitch estimation, pitch transformation, and noise suppression when paired with a HiFi-GAN vocoder. Empirical results show strong performance in analysis-resynthesis, accurate $f_0$ estimation under noisy and reverberant conditions, and competitive denoising quality, though speaker identity fidelity for unseen speakers remains challenging. The work advances practical speech editing and enhancement applications, while highlighting future directions such as finer or hierarchical speaker representations to improve generalization to unseen voices.

Abstract

This article introduces AnCoGen, a novel method that leverages a masked autoencoder to unify the analysis, control, and generation of speech signals within a single model. AnCoGen can analyze speech by estimating key attributes, such as speaker identity, pitch, content, loudness, signal-to-noise ratio, and clarity index. In addition, it can generate speech from these attributes and allow precise control of the synthesized speech by modifying them. Extensive experiments demonstrated the effectiveness of AnCoGen across speech analysis-resynthesis, pitch estimation, pitch modification, and speech enhancement.

AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder

TL;DR

AnCoGen presents a bidirectional masked autoencoder that jointly maps between Mel-spectrograms and high-level speech attributes, enabling simultaneous analysis, manipulation, and synthesis within a single model. By leveraging coupled masking of MS and SA representations and a VQ-VAE–based tokenization, the approach learns rich inter- and intra-representation dependencies and enables robust pitch estimation, pitch transformation, and noise suppression when paired with a HiFi-GAN vocoder. Empirical results show strong performance in analysis-resynthesis, accurate estimation under noisy and reverberant conditions, and competitive denoising quality, though speaker identity fidelity for unseen speakers remains challenging. The work advances practical speech editing and enhancement applications, while highlighting future directions such as finer or hierarchical speaker representations to improve generalization to unseen voices.

Abstract

This article introduces AnCoGen, a novel method that leverages a masked autoencoder to unify the analysis, control, and generation of speech signals within a single model. AnCoGen can analyze speech by estimating key attributes, such as speaker identity, pitch, content, loudness, signal-to-noise ratio, and clarity index. In addition, it can generate speech from these attributes and allow precise control of the synthesized speech by modifying them. Extensive experiments demonstrated the effectiveness of AnCoGen across speech analysis-resynthesis, pitch estimation, pitch modification, and speech enhancement.
Paper Structure (10 sections, 4 figures, 3 tables)

This paper contains 10 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Analysis, control, and generation of speech with AnCoGen.
  • Figure 2: Overall architecture of AnCoGen, to read from left (input) to right (output).
  • Figure 3: $f_0$ estimation results.
  • Figure 3: Speech denoising results (best score in each column is in bold, second best score is underlined).