Table of Contents
Fetching ...

DGSNA: prompt-based Dynamic Generative Scene-based Noise Addition method

Zihao Chen, Zhentao Lin, Bi Zeng, Linyi Huang, Zhi Li, Jia Cai

TL;DR

DGSNA addresses the challenge of robustly simulating real-world noisy scenes for speech processing by integrating dynamic generation of scene-based information (DGSI) with scene-based noise addition for speech (SNAS) using a BET prompt framework and generative chat models. It combines TTA-based diffusion noise generation (AudioLDM/CLAP/VAE/HiFi-GAN) with RIR-based convolution to produce scene-consistent noisy speech, enabling automated and scalable augmentation. The approach yields up to $11.21\%$ relative improvements in ASR and keyword spotting robustness across varied ANRs, and demonstrates compatibility with other augmentation methods, with generalizability across multiple generative chat models. This framework advances realistic data augmentation for diverse acoustic environments, reducing reliance on labor-intensive scene enumeration and noisy data collection, and enhances practical deployment of robust speech systems.

Abstract

To ensure the reliable operation of speech systems across diverse environments, noise addition methods have emerged as the prevailing solution. However, existing methods offer limited coverage of real-world noisy scenes and depend on pre-existing scene-based information and noise. This paper presents prompt-based Dynamic Generative Scene-based Noise Addition (DGSNA), a novel noise addition methodology that integrates Dynamic Generation of Scene-based Information (DGSI) with Scene-based Noise Addition for Speech (SNAS). This integration facilitates automated scene-based noise addition by transforming clean speech into various noise environments, thereby providing a more comprehensive and realistic simulation of diverse noise conditions. Experimental results demonstrate that DGSNA significantly enhances the robustness of speech recognition and keyword spotting models across various noise conditions, achieving a relative improvement of up to 11.21%. Furthermore, DGSNA can be effectively integrated with other noise addition methods to enhance performance. Our implementation and demonstrations are available at https://dgsna.github.io.

DGSNA: prompt-based Dynamic Generative Scene-based Noise Addition method

TL;DR

DGSNA addresses the challenge of robustly simulating real-world noisy scenes for speech processing by integrating dynamic generation of scene-based information (DGSI) with scene-based noise addition for speech (SNAS) using a BET prompt framework and generative chat models. It combines TTA-based diffusion noise generation (AudioLDM/CLAP/VAE/HiFi-GAN) with RIR-based convolution to produce scene-consistent noisy speech, enabling automated and scalable augmentation. The approach yields up to relative improvements in ASR and keyword spotting robustness across varied ANRs, and demonstrates compatibility with other augmentation methods, with generalizability across multiple generative chat models. This framework advances realistic data augmentation for diverse acoustic environments, reducing reliance on labor-intensive scene enumeration and noisy data collection, and enhances practical deployment of robust speech systems.

Abstract

To ensure the reliable operation of speech systems across diverse environments, noise addition methods have emerged as the prevailing solution. However, existing methods offer limited coverage of real-world noisy scenes and depend on pre-existing scene-based information and noise. This paper presents prompt-based Dynamic Generative Scene-based Noise Addition (DGSNA), a novel noise addition methodology that integrates Dynamic Generation of Scene-based Information (DGSI) with Scene-based Noise Addition for Speech (SNAS). This integration facilitates automated scene-based noise addition by transforming clean speech into various noise environments, thereby providing a more comprehensive and realistic simulation of diverse noise conditions. Experimental results demonstrate that DGSNA significantly enhances the robustness of speech recognition and keyword spotting models across various noise conditions, achieving a relative improvement of up to 11.21%. Furthermore, DGSNA can be effectively integrated with other noise addition methods to enhance performance. Our implementation and demonstrations are available at https://dgsna.github.io.

Paper Structure

This paper contains 26 sections, 6 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Comparative analysis of noise addition methods.
  • Figure 2: Overall framework of the proposed DGSNA method. In this framework, the BET Prompt Framework module is structured into two forms (for details, see Section \ref{['subsec5_3']}). Additionally, the Generative Chat Model module incorporates a design pattern that integrates the logos of three different generative chat models, enhancing adaptability and functionality (for details, see Section \ref{['subsec4_3']}).
  • Figure 3: Examples of scene dynamic generation. Initially, the B (Background) component entails the user providing a clear and detailed description of the specified scene's design and context (red font). Subsequently, the E (Examples) component requires the user to input few-shot prompts, which include a text description of the target scene (blue font). Following this, the generative chat model generates scene-based information that aligns with the predefined task background (green font). Finally, the T (Task) component involves the user specifying requirements for the dynamic generation of the scene (purple font).
  • Figure 4: Overview of DGSNA data generation.
  • Figure 5: Results of the KWS experiment.
  • ...and 5 more figures