Table of Contents
Fetching ...

H-LDM: Hierarchical Latent Diffusion Models for Controllable and Interpretable PCG Synthesis from Clinical Metadata

Chenyang Xu, Siming Li, Hao Wang

TL;DR

The paper tackles data scarcity in PCG-based cardiovascular diagnosis by introducing H-LDM, a hierarchical latent diffusion framework that generates high-fidelity, clinically controllable phonocardiograms from structured clinical metadata. It combines a Multi-Scale Variational Autoencoder that learns a physiologically-disentangled latent space with a conditional diffusion model guided by rich metadata, including a Medical Attention module to enforce physiological periodicity. Key contributions include the physiologically-disentangled latent subspaces (rhythm, S1/S2, murmur, noise), a hierarchical metadata encoding that fuses BERT-based narratives with a patient knowledge graph via GraphSAGE, and a diffusion process with structured noise prediction for explicit attribute control; results on CirCor show $FAD=9.7$, $AD=0.92$, and $CV=0.87$, with an 11.3% improvement in rare-disease classification when augmenting data. This work advances clinical data augmentation by enabling interpretable, counterfactual exploration of cardiac pathologies, with potential to improve education and diagnostic robustness in real-world settings.

Abstract

Phonocardiogram (PCG) analysis is vital for cardiovascular disease diagnosis, yet the scarcity of labeled pathological data hinders the capability of AI systems. To bridge this, we introduce H-LDM, a Hierarchical Latent Diffusion Model for generating clinically accurate and controllable PCG signals from structured metadata. Our approach features: (1) a multi-scale VAE that learns a physiologically-disentangled latent space, separating rhythm, heart sounds, and murmurs; (2) a hierarchical text-to-biosignal pipeline that leverages rich clinical metadata for fine-grained control over 17 distinct conditions; and (3) an interpretable diffusion process guided by a novel Medical Attention module. Experiments on the PhysioNet CirCor dataset demonstrate state-of-the-art performance, achieving a Fréchet Audio Distance of 9.7, a 92% attribute disentanglement score, and 87.1% clinical validity confirmed by cardiologists. Augmenting diagnostic models with our synthetic data improves the accuracy of rare disease classification by 11.3\%. H-LDM establishes a new direction for data augmentation in cardiac diagnostics, bridging data scarcity with interpretable clinical insights.

H-LDM: Hierarchical Latent Diffusion Models for Controllable and Interpretable PCG Synthesis from Clinical Metadata

TL;DR

The paper tackles data scarcity in PCG-based cardiovascular diagnosis by introducing H-LDM, a hierarchical latent diffusion framework that generates high-fidelity, clinically controllable phonocardiograms from structured clinical metadata. It combines a Multi-Scale Variational Autoencoder that learns a physiologically-disentangled latent space with a conditional diffusion model guided by rich metadata, including a Medical Attention module to enforce physiological periodicity. Key contributions include the physiologically-disentangled latent subspaces (rhythm, S1/S2, murmur, noise), a hierarchical metadata encoding that fuses BERT-based narratives with a patient knowledge graph via GraphSAGE, and a diffusion process with structured noise prediction for explicit attribute control; results on CirCor show , , and , with an 11.3% improvement in rare-disease classification when augmenting data. This work advances clinical data augmentation by enabling interpretable, counterfactual exploration of cardiac pathologies, with potential to improve education and diagnostic robustness in real-world settings.

Abstract

Phonocardiogram (PCG) analysis is vital for cardiovascular disease diagnosis, yet the scarcity of labeled pathological data hinders the capability of AI systems. To bridge this, we introduce H-LDM, a Hierarchical Latent Diffusion Model for generating clinically accurate and controllable PCG signals from structured metadata. Our approach features: (1) a multi-scale VAE that learns a physiologically-disentangled latent space, separating rhythm, heart sounds, and murmurs; (2) a hierarchical text-to-biosignal pipeline that leverages rich clinical metadata for fine-grained control over 17 distinct conditions; and (3) an interpretable diffusion process guided by a novel Medical Attention module. Experiments on the PhysioNet CirCor dataset demonstrate state-of-the-art performance, achieving a Fréchet Audio Distance of 9.7, a 92% attribute disentanglement score, and 87.1% clinical validity confirmed by cardiologists. Augmenting diagnostic models with our synthetic data improves the accuracy of rare disease classification by 11.3\%. H-LDM establishes a new direction for data augmentation in cardiac diagnostics, bridging data scarcity with interpretable clinical insights.

Paper Structure

This paper contains 18 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Architecture Overview.
  • Figure 2: U-Net Architecture for Latent Diffusion. Our U-Net operates in the compressed latent space. Clinical condition embeddings ($\mathbf{c}$) and timestep embeddings ($\mathbf{t}$) are fused and injected into each ResBlock to guide the denoising process. Medical Attention modules enforce physiological periodicity, while skip connections preserve high-frequency details by linking encoder and decoder features.
  • Figure 3: Correlation Analysis of Evaluation Metrics. (a) Core fidelity metrics (e.g., SNR, SSIM) are highly correlated, indicating they capture similar aspects of signal quality. (b) Our proposed Physiological Disentanglement Score (PDS) shows low correlation with other metrics, confirming it measures a unique, non-redundant aspect of model performance: interpretability. This validates its inclusion as a key evaluation criterion.