H-LDM: Hierarchical Latent Diffusion Models for Controllable and Interpretable PCG Synthesis from Clinical Metadata

Chenyang Xu; Siming Li; Hao Wang

H-LDM: Hierarchical Latent Diffusion Models for Controllable and Interpretable PCG Synthesis from Clinical Metadata

Chenyang Xu, Siming Li, Hao Wang

TL;DR

The paper tackles data scarcity in PCG-based cardiovascular diagnosis by introducing H-LDM, a hierarchical latent diffusion framework that generates high-fidelity, clinically controllable phonocardiograms from structured clinical metadata. It combines a Multi-Scale Variational Autoencoder that learns a physiologically-disentangled latent space with a conditional diffusion model guided by rich metadata, including a Medical Attention module to enforce physiological periodicity. Key contributions include the physiologically-disentangled latent subspaces (rhythm, S1/S2, murmur, noise), a hierarchical metadata encoding that fuses BERT-based narratives with a patient knowledge graph via GraphSAGE, and a diffusion process with structured noise prediction for explicit attribute control; results on CirCor show $FAD=9.7$, $AD=0.92$, and $CV=0.87$, with an 11.3% improvement in rare-disease classification when augmenting data. This work advances clinical data augmentation by enabling interpretable, counterfactual exploration of cardiac pathologies, with potential to improve education and diagnostic robustness in real-world settings.

Abstract

Phonocardiogram (PCG) analysis is vital for cardiovascular disease diagnosis, yet the scarcity of labeled pathological data hinders the capability of AI systems. To bridge this, we introduce H-LDM, a Hierarchical Latent Diffusion Model for generating clinically accurate and controllable PCG signals from structured metadata. Our approach features: (1) a multi-scale VAE that learns a physiologically-disentangled latent space, separating rhythm, heart sounds, and murmurs; (2) a hierarchical text-to-biosignal pipeline that leverages rich clinical metadata for fine-grained control over 17 distinct conditions; and (3) an interpretable diffusion process guided by a novel Medical Attention module. Experiments on the PhysioNet CirCor dataset demonstrate state-of-the-art performance, achieving a Fréchet Audio Distance of 9.7, a 92% attribute disentanglement score, and 87.1% clinical validity confirmed by cardiologists. Augmenting diagnostic models with our synthetic data improves the accuracy of rare disease classification by 11.3\%. H-LDM establishes a new direction for data augmentation in cardiac diagnostics, bridging data scarcity with interpretable clinical insights.

H-LDM: Hierarchical Latent Diffusion Models for Controllable and Interpretable PCG Synthesis from Clinical Metadata

TL;DR

Abstract

H-LDM: Hierarchical Latent Diffusion Models for Controllable and Interpretable PCG Synthesis from Clinical Metadata

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)