Table of Contents
Fetching ...

Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation

Myeonghoon Ryu, Hongseok Oh, Suji Lee, Han Park

TL;DR

The paper tackles the problem of sound event classification under device variability caused by recording hardware. It proposes Unified Microphone Conversion, a FiLM-conditioned CycleGAN framework that achieves many-to-many device mappings using a single generator and multiple discriminators, augmented by a synthetic frequency response difference generator. Key contributions include the FiLM encoder that modulates feature statistics with device-specific embeddings, the integration of frequency-response information into the generator, and a scalable synthetic FR difference strategy that reduces data collection needs. Empirical results show improvements of $2.6\%$ in macro-average F1 and a $0.8\%$ reduction in variability compared to state-of-the-art, demonstrating scalable, robust SEC performance across diverse devices.

Abstract

We present Unified Microphone Conversion, a unified generative framework designed to bolster sound event classification (SEC) systems against device variability. While our prior CycleGAN-based methods effectively simulate device characteristics, they require separate models for each device pair, limiting scalability. Our approach overcomes this constraint by conditioning the generator on frequency response data, enabling many-to-many device mappings through unpaired training. We integrate frequency-response information via Feature-wise Linear Modulation, further enhancing scalability. Additionally, incorporating synthetic frequency response differences improves the applicability of our framework for real-world application. Experimental results show that our method outperforms the state-of-the-art by 2.6% and reduces variability by 0.8% in macro-average F1 score.

Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation

TL;DR

The paper tackles the problem of sound event classification under device variability caused by recording hardware. It proposes Unified Microphone Conversion, a FiLM-conditioned CycleGAN framework that achieves many-to-many device mappings using a single generator and multiple discriminators, augmented by a synthetic frequency response difference generator. Key contributions include the FiLM encoder that modulates feature statistics with device-specific embeddings, the integration of frequency-response information into the generator, and a scalable synthetic FR difference strategy that reduces data collection needs. Empirical results show improvements of in macro-average F1 and a reduction in variability compared to state-of-the-art, demonstrating scalable, robust SEC performance across diverse devices.

Abstract

We present Unified Microphone Conversion, a unified generative framework designed to bolster sound event classification (SEC) systems against device variability. While our prior CycleGAN-based methods effectively simulate device characteristics, they require separate models for each device pair, limiting scalability. Our approach overcomes this constraint by conditioning the generator on frequency response data, enabling many-to-many device mappings through unpaired training. We integrate frequency-response information via Feature-wise Linear Modulation, further enhancing scalability. Additionally, incorporating synthetic frequency response differences improves the applicability of our framework for real-world application. Experimental results show that our method outperforms the state-of-the-art by 2.6% and reduces variability by 0.8% in macro-average F1 score.

Paper Structure

This paper contains 11 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The training pipeline of the Unified Microphone Conversion model. The model receives an input spectrogram and a frequency response difference—computed by subtracting the frequency response of the source device from that of the target device—and outputs a spectrogram that simulates the target device domain. The FiLM encoder is jointly trained with $G$ to learn device-specific modulation parameters, while each domain $i$ is assigned its own discriminator $D_i$ to enforce alignment with the target device’s distribution.
  • Figure 2: The training phase for SEC models using the trained Unified Microphone Conversion model $G$. Either real-world or synthetic frequency response difference data are provided to the FiLM encoder, which then modulates the generator $G$, producing converted spectrograms for robust SEC model training.
  • Figure 3: (left) Mutual information estimates between the target device and the dimension-wise statistics—the embeddings averaged across each dimension—at different layers of the Microphone Conversion network. (right) Classification accuracy of the target device given these dimension-wise statistics.
  • Figure 4: The first two rows display the input spectrogram of different acoustic contents and recording devices, and the frequency response difference between the target and input devices. The third row presents samples generated by Unified Microphone Conversion using these inputs, while the final row depicts ground truth spectrograms from the target devices.