Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis
Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu
TL;DR
Resp-Agent tackles the dual challenges of information loss and data scarcity in respiratory-sound analysis by uniting diagnostic reasoning with controllable audio synthesis in a closed-loop, agent-based framework. It introduces Resp-229k, a large cross-domain benchmark with LLM-generated clinical narratives to enable robust multimodal learning and evaluation under strict domain shifts. The Thinker-A$^2$CA planner coordinates a Generator (discrete-unit planning and Flow Matching) and a Diagnoser (modality weaving with Strategic Global Attention and audio anchors), achieving strong gains in minority-class accuracy and macro-F1, while enabling targeted synthesis to rebalance data. Empirically, Resp-Agent improves diagnostic robustness on cross-domain data and demonstrates that content-aware synthesis substantially outperforms naive augmentation, highlighting the practical potential for data-efficient, edge-aware medical-audio AI and clinician-in-the-loop systems. The work provides a foundation for trustworthy multimodal respiratory AI with accessible code and data for reproducibility under privacy-preserving constraints.
Abstract
Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.
