DGSNA: prompt-based Dynamic Generative Scene-based Noise Addition method

Zihao Chen; Zhentao Lin; Bi Zeng; Linyi Huang; Zhi Li; Jia Cai

DGSNA: prompt-based Dynamic Generative Scene-based Noise Addition method

Zihao Chen, Zhentao Lin, Bi Zeng, Linyi Huang, Zhi Li, Jia Cai

TL;DR

DGSNA addresses the challenge of robustly simulating real-world noisy scenes for speech processing by integrating dynamic generation of scene-based information (DGSI) with scene-based noise addition for speech (SNAS) using a BET prompt framework and generative chat models. It combines TTA-based diffusion noise generation (AudioLDM/CLAP/VAE/HiFi-GAN) with RIR-based convolution to produce scene-consistent noisy speech, enabling automated and scalable augmentation. The approach yields up to $11.21\%$ relative improvements in ASR and keyword spotting robustness across varied ANRs, and demonstrates compatibility with other augmentation methods, with generalizability across multiple generative chat models. This framework advances realistic data augmentation for diverse acoustic environments, reducing reliance on labor-intensive scene enumeration and noisy data collection, and enhances practical deployment of robust speech systems.

Abstract

To ensure the reliable operation of speech systems across diverse environments, noise addition methods have emerged as the prevailing solution. However, existing methods offer limited coverage of real-world noisy scenes and depend on pre-existing scene-based information and noise. This paper presents prompt-based Dynamic Generative Scene-based Noise Addition (DGSNA), a novel noise addition methodology that integrates Dynamic Generation of Scene-based Information (DGSI) with Scene-based Noise Addition for Speech (SNAS). This integration facilitates automated scene-based noise addition by transforming clean speech into various noise environments, thereby providing a more comprehensive and realistic simulation of diverse noise conditions. Experimental results demonstrate that DGSNA significantly enhances the robustness of speech recognition and keyword spotting models across various noise conditions, achieving a relative improvement of up to 11.21%. Furthermore, DGSNA can be effectively integrated with other noise addition methods to enhance performance. Our implementation and demonstrations are available at https://dgsna.github.io.

DGSNA: prompt-based Dynamic Generative Scene-based Noise Addition method

TL;DR

Abstract

DGSNA: prompt-based Dynamic Generative Scene-based Noise Addition method

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)