Table of Contents
Fetching ...

Did You Hear That? Introducing AADG: A Framework for Generating Benchmark Data in Audio Anomaly Detection

Ksheeraja Raghavan, Samiran Gode, Ankit Shah, Surabhi Raghavan, Wolfram Burgard, Bhiksha Raj, Rita Singh

TL;DR

This work tackles the scarcity of real-world, audio-only anomaly benchmarks by introducing AADG, a modular framework that leverages large language models as world models to synthesize plausible anomalous audio scenarios. The pipeline generates scenario descriptions, extracts structured audio-creation instructions, synthesizes component sounds with text-to-audio models, and verifies outputs via logical checks and a multimodal alignment step before merging into final datasets with rich metadata. Key contributions include the first general-purpose audio anomaly data generation framework, a plug-and-play architecture adaptable to evolving LLMs and audio models, and comprehensive evaluations showing improvements over existing text-to-audio methods and insights into ALM and separation-model capabilities. The framework enables scalable, diverse, realistic audio benchmarks that can enhance training and evaluation for audio anomaly detection in real-world, audio-only contexts such as surveillance and telecommunication.

Abstract

We introduce a novel, general-purpose audio generation framework specifically designed for anomaly detection and localization. Unlike existing datasets that predominantly focus on industrial and machine-related sounds, our framework focuses a broader range of environments, particularly useful in real-world scenarios where only audio data are available, such as in video-derived or telephonic audio. To generate such data, we propose a new method inspired by the LLM-Modulo framework, which leverages large language models(LLMs) as world models to simulate such real-world scenarios. This tool is modular allowing a plug-and-play approach. It operates by first using LLMs to predict plausible real-world scenarios. An LLM further extracts the constituent sounds, the order and the way in which these should be merged to create coherent wholes. Much like the LLM-Modulo framework, we include rigorous verification of each output stage, ensuring the reliability of the generated data. The data produced using the framework serves as a benchmark for anomaly detection applications, potentially enhancing the performance of models trained on audio data, particularly in handling out-of-distribution cases. Our contributions thus fill a critical void in audio anomaly detection resources and provide a scalable tool for generating diverse, realistic audio data.

Did You Hear That? Introducing AADG: A Framework for Generating Benchmark Data in Audio Anomaly Detection

TL;DR

This work tackles the scarcity of real-world, audio-only anomaly benchmarks by introducing AADG, a modular framework that leverages large language models as world models to synthesize plausible anomalous audio scenarios. The pipeline generates scenario descriptions, extracts structured audio-creation instructions, synthesizes component sounds with text-to-audio models, and verifies outputs via logical checks and a multimodal alignment step before merging into final datasets with rich metadata. Key contributions include the first general-purpose audio anomaly data generation framework, a plug-and-play architecture adaptable to evolving LLMs and audio models, and comprehensive evaluations showing improvements over existing text-to-audio methods and insights into ALM and separation-model capabilities. The framework enables scalable, diverse, realistic audio benchmarks that can enhance training and evaluation for audio anomaly detection in real-world, audio-only contexts such as surveillance and telecommunication.

Abstract

We introduce a novel, general-purpose audio generation framework specifically designed for anomaly detection and localization. Unlike existing datasets that predominantly focus on industrial and machine-related sounds, our framework focuses a broader range of environments, particularly useful in real-world scenarios where only audio data are available, such as in video-derived or telephonic audio. To generate such data, we propose a new method inspired by the LLM-Modulo framework, which leverages large language models(LLMs) as world models to simulate such real-world scenarios. This tool is modular allowing a plug-and-play approach. It operates by first using LLMs to predict plausible real-world scenarios. An LLM further extracts the constituent sounds, the order and the way in which these should be merged to create coherent wholes. Much like the LLM-Modulo framework, we include rigorous verification of each output stage, ensuring the reliability of the generated data. The data produced using the framework serves as a benchmark for anomaly detection applications, potentially enhancing the performance of models trained on audio data, particularly in handling out-of-distribution cases. Our contributions thus fill a critical void in audio anomaly detection resources and provide a scalable tool for generating diverse, realistic audio data.
Paper Structure (19 sections, 2 equations, 2 figures, 3 tables)

This paper contains 19 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Audio Anomaly Data Generation (AADG), a framework that synthetically generates real life Audio Data with Anomalies by leveraging LLMs as a world model
  • Figure 2: Illustration of the pipeline for generating and verifying anomalous audio data. The process begins with scene generation, followed by information extraction using a Large Language Model (LLM). Individual audio components are synthesized from text descriptions and meticulously verified for accuracy and merged according to LLM instructions, culminating in a dataset of realistic anomalous audio.