Table of Contents
Fetching ...

Which Augmentation Should I Use? An Empirical Investigation of Augmentations for Self-Supervised Phonocardiogram Representation Learning

Aristotelis Ballas, Vasileios Papapanagiotou, Christos Diou

TL;DR

The paper tackles robust phonocardiogram (PCG) classification under distribution shifts by systematically evaluating a wide range of augmentation policies within a self-supervised contrastive learning framework inspired by SimCLR. It adapts NT-Xent-based SSL to 1D PCG signals, pretrains on unlabeled data from multiple datasets, and fine-tunes downstream classifiers on labeled data, then tests generalization on unseen datasets. Key findings show SSL representations yield substantially better out-of-distribution generalization than fully supervised models (average gains ~$11.68\%$, up to $21.04\%$ on PhysioNet 2022 OOD), with low-pass augmentations and signal flips/scale consistently boosting performance; Pascal, despite its smaller size, benefits notably from SSL when labeled data are scarce. The study provides practical augmentation guidelines and a rigorous evaluation protocol, offering a pathway to more reliable PCG models in real-world healthcare settings, and suggests extending the approach to other 1D biosignals.

Abstract

Despite recent advancements in deep learning, its application in real-world medical settings, such as phonocardiogram (PCG) classification, remains limited. A significant barrier is the lack of high-quality annotated datasets, which hampers the development of robust, generalizable models that can perform well on newly collected, out-of-distribution (OOD) data. Self-Supervised Learning (SSL) contrastive learning, has shown promise in mitigating the issue of data scarcity by using unlabeled data to enhance model robustness. Even though SSL methods have been proposed and researched in other domains, works focusing on the impact of data augmentations on model robustness for PCG classification are limited. In particular, while augmentations are a key component in SSL, selecting the most suitable policy during training is highly challenging. Improper augmentations can lead to substantial performance degradation and even hinder a network's ability to learn meaningful representations. Addressing this gap, our research aims to explore and evaluate a wide range of audio-based augmentations and uncover combinations that enhance SSL model performance in PCG classification. We conduct a comprehensive comparative analysis across multiple datasets, assessing the impact of various augmentations on model performance. Our findings reveal that depending on the training distribution, augmentation choice significantly influences model robustness, with fully-supervised models experiencing up to a 32\% drop in effectiveness when evaluated on unseen data, while SSL models demonstrate greater resilience, losing only 10\% or even improving in some cases. This study also highlights the most promising and appropriate augmentations for PCG signal processing, by calculating their effect size on training. These insights equip researchers with valuable guidelines for developing reliable models in PCG signal processing.

Which Augmentation Should I Use? An Empirical Investigation of Augmentations for Self-Supervised Phonocardiogram Representation Learning

TL;DR

The paper tackles robust phonocardiogram (PCG) classification under distribution shifts by systematically evaluating a wide range of augmentation policies within a self-supervised contrastive learning framework inspired by SimCLR. It adapts NT-Xent-based SSL to 1D PCG signals, pretrains on unlabeled data from multiple datasets, and fine-tunes downstream classifiers on labeled data, then tests generalization on unseen datasets. Key findings show SSL representations yield substantially better out-of-distribution generalization than fully supervised models (average gains ~, up to on PhysioNet 2022 OOD), with low-pass augmentations and signal flips/scale consistently boosting performance; Pascal, despite its smaller size, benefits notably from SSL when labeled data are scarce. The study provides practical augmentation guidelines and a rigorous evaluation protocol, offering a pathway to more reliable PCG models in real-world healthcare settings, and suggests extending the approach to other 1D biosignals.

Abstract

Despite recent advancements in deep learning, its application in real-world medical settings, such as phonocardiogram (PCG) classification, remains limited. A significant barrier is the lack of high-quality annotated datasets, which hampers the development of robust, generalizable models that can perform well on newly collected, out-of-distribution (OOD) data. Self-Supervised Learning (SSL) contrastive learning, has shown promise in mitigating the issue of data scarcity by using unlabeled data to enhance model robustness. Even though SSL methods have been proposed and researched in other domains, works focusing on the impact of data augmentations on model robustness for PCG classification are limited. In particular, while augmentations are a key component in SSL, selecting the most suitable policy during training is highly challenging. Improper augmentations can lead to substantial performance degradation and even hinder a network's ability to learn meaningful representations. Addressing this gap, our research aims to explore and evaluate a wide range of audio-based augmentations and uncover combinations that enhance SSL model performance in PCG classification. We conduct a comprehensive comparative analysis across multiple datasets, assessing the impact of various augmentations on model performance. Our findings reveal that depending on the training distribution, augmentation choice significantly influences model robustness, with fully-supervised models experiencing up to a 32\% drop in effectiveness when evaluated on unseen data, while SSL models demonstrate greater resilience, losing only 10\% or even improving in some cases. This study also highlights the most promising and appropriate augmentations for PCG signal processing, by calculating their effect size on training. These insights equip researchers with valuable guidelines for developing reliable models in PCG signal processing.
Paper Structure (16 sections, 4 equations, 4 figures, 2 tables)

This paper contains 16 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of the proposed experiment pipeline for training and evaluating the effectiveness and robustness of a model trained via Self-Supervised Contrastive Learning for PCG classification. The framework has six steps. In the first step, all datasets are prepared and homogenized into a common format, as described in Section \ref{['sec:ds_task']}. In the second step, each signal is augmented to two versions. In the next step, the unlabeled and augmented signals are used to train the backbone encoder (Fig. \ref{['fig:architecture']} left) to maximize the agreement between representations originating from the same initial signal. Following the pretraining phase, in step 4, a classification head (Fig. \ref{['fig:architecture']} right) is attached to the frozen weights of the pretrained encoder. The classifier weights are fine-tuned on data drawn from one of the Pascal, PhysioNet2016 or PhysioNet2022 datasets, in a fully supervised manner. The final model is then evaluated on the test split of the same dataset. In the fifth step of the framework, the generalization ability of the fine-tuned model is assessed on signals drawn from datasets which are left-out ("unseen") during the training process in step 4. We argue that given sufficient data and appropriate augmentations, the pretrained encoder will be able to extract a generalized PCG representation regardless the dataset. Finally, we repeat the fine-tuning and OOD evaluation steps for each remaining dataset (step 6), to assess the robustness of our method when the classifier has been trained on different signal distributions.
  • Figure 2: The architecture of the CNN encoder trained via SSL contrastive learning (left) and the classification head trained for the downstream task (right). The weights of the encoder or "backbone network" are frozen after the pretraining phase. During the downstream task, the classification head is attached to the pretrained encoder and its weights are fine-tuned on the dataset of said task in a fully-supervised manner. The architecture of the "downstream model" is used as a baseline in all of our experiments, where all of its layer weights are adjusted during fully-supervised training.
  • Figure 3: Number of occurrences of each augmentation type in the top 75 performing models in each downstream task, leading to 150 total augmentation pairs. This offers and indication on the importance of each augmentation across all experiments. The plot on the left presents the result for In-Distribution experiments, while the plot on the right the results for the OOD experiments.
  • Figure 4: T-SNE visualizations of feature vectors extracted from the penultimate layer, after model training of a baseline fully-supervised model and a model trained via contrastive SSL. In order to exhibit the generalization capabilities of a model trained via the proposed framework, we plot the features of both In-Distribution, from the PhysioNet2016 dataset and OOD samples from the PhysioNet2022. Abnormal signals are colored in blue, while normal signals are shown in orange. As illustrated, the features extracted by the baseline models are chaotic and perplexed, whereas the SSL model seems to group samples in a more structured manner.