Table of Contents
Fetching ...

Improved Techniques for the Conditional Generative Augmentation of Clinical Audio Data

Mane Margaryan, Matthias Seibold, Indu Joshi, Mazda Farshad, Philipp Fürnstahl, Nassir Navab

TL;DR

This work tackles the data bottleneck in clinical audio analysis by proposing a conditional GAN augmentation framework that generates mel spectrograms from a learned distribution. The generator uses residual Squeeze-and-Excitation blocks (SE-ResNet) to reduce latent feature redundancy, improving sample quality as evidenced by a lower Fréchet Inception Distance and better classifier performance. On a Total Hip Arthroplasty sound dataset, the approach achieves a Macro F1-score of $96.74 \pm 1.03\%$, a $2.84\%$ improvement over a prior generative method, and shows about a $0.3$ point FID improvement, with only ~0.74\% generator parameter growth. These results highlight the practical potential of SE-enabled conditional augmentation to mitigate data scarcity in clinical acoustic sensing and to improve downstream diagnostic/classification tasks.

Abstract

Data augmentation is a valuable tool for the design of deep learning systems to overcome data limitations and stabilize the training process. Especially in the medical domain, where the collection of large-scale data sets is challenging and expensive due to limited access to patient data, relevant environments, as well as strict regulations, community-curated large-scale public datasets, pretrained models, and advanced data augmentation methods are the main factors for developing reliable systems to improve patient care. However, for the development of medical acoustic sensing systems, an emerging field of research, the community lacks large-scale publicly available data sets and pretrained models. To address the problem of limited data, we propose a conditional generative adversarial neural network-based augmentation method which is able to synthesize mel spectrograms from a learned data distribution of a source data set. In contrast to previously proposed fully convolutional models, the proposed model implements residual Squeeze and Excitation modules in the generator architecture. We show that our method outperforms all classical audio augmentation techniques and previously published generative methods in terms of generated sample quality and a performance improvement of 2.84% of Macro F1-Score for a classifier trained on the augmented data set, an enhancement of $1.14\%$ in relation to previous work. By analyzing the correlation of intermediate feature spaces, we show that the residual Squeeze and Excitation modules help the model to reduce redundancy in the latent features. Therefore, the proposed model advances the state-of-the-art in the augmentation of clinical audio data and improves the data bottleneck for the design of clinical acoustic sensing systems.

Improved Techniques for the Conditional Generative Augmentation of Clinical Audio Data

TL;DR

This work tackles the data bottleneck in clinical audio analysis by proposing a conditional GAN augmentation framework that generates mel spectrograms from a learned distribution. The generator uses residual Squeeze-and-Excitation blocks (SE-ResNet) to reduce latent feature redundancy, improving sample quality as evidenced by a lower Fréchet Inception Distance and better classifier performance. On a Total Hip Arthroplasty sound dataset, the approach achieves a Macro F1-score of , a improvement over a prior generative method, and shows about a point FID improvement, with only ~0.74\% generator parameter growth. These results highlight the practical potential of SE-enabled conditional augmentation to mitigate data scarcity in clinical acoustic sensing and to improve downstream diagnostic/classification tasks.

Abstract

Data augmentation is a valuable tool for the design of deep learning systems to overcome data limitations and stabilize the training process. Especially in the medical domain, where the collection of large-scale data sets is challenging and expensive due to limited access to patient data, relevant environments, as well as strict regulations, community-curated large-scale public datasets, pretrained models, and advanced data augmentation methods are the main factors for developing reliable systems to improve patient care. However, for the development of medical acoustic sensing systems, an emerging field of research, the community lacks large-scale publicly available data sets and pretrained models. To address the problem of limited data, we propose a conditional generative adversarial neural network-based augmentation method which is able to synthesize mel spectrograms from a learned data distribution of a source data set. In contrast to previously proposed fully convolutional models, the proposed model implements residual Squeeze and Excitation modules in the generator architecture. We show that our method outperforms all classical audio augmentation techniques and previously published generative methods in terms of generated sample quality and a performance improvement of 2.84% of Macro F1-Score for a classifier trained on the augmented data set, an enhancement of in relation to previous work. By analyzing the correlation of intermediate feature spaces, we show that the residual Squeeze and Excitation modules help the model to reduce redundancy in the latent features. Therefore, the proposed model advances the state-of-the-art in the augmentation of clinical audio data and improves the data bottleneck for the design of clinical acoustic sensing systems.
Paper Structure (8 sections, 2 equations, 3 figures, 1 table)

This paper contains 8 sections, 2 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: The schematic illustrates the structure of the proposed SE-ResNet generator for the generation of synthetic mel spectrograms.
  • Figure 2: Log-mel spectrogram of random samples for each class (top row); log-mel spectrogram of random generated images of our proposed model (second row); log-mel spectrogram of the model proposed in the previous workseibold2022conditional (bottom row). Respective classes from left to right: Sawing, Adjustment, Reaming, Coagulation, Insertion, Suction
  • Figure 3: Sample correlation matrices of features learned by the proposed model (left column) and cWGAN-GP published in previous work seibold2022conditional. The correlation matrices are computed from different intermediate layers of the generator network. The plots represent the correlation in the feature space after the second-last convolutional layer with dimensions 32x32x64. The significantly lower correlation values obtained after introducing Squeeze & Excitation block demonstrate the reduced correlation among features and therefore reduced feature redundancy.