Table of Contents
Fetching ...

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu

TL;DR

Resp-Agent tackles the dual challenges of information loss and data scarcity in respiratory-sound analysis by uniting diagnostic reasoning with controllable audio synthesis in a closed-loop, agent-based framework. It introduces Resp-229k, a large cross-domain benchmark with LLM-generated clinical narratives to enable robust multimodal learning and evaluation under strict domain shifts. The Thinker-A$^2$CA planner coordinates a Generator (discrete-unit planning and Flow Matching) and a Diagnoser (modality weaving with Strategic Global Attention and audio anchors), achieving strong gains in minority-class accuracy and macro-F1, while enabling targeted synthesis to rebalance data. Empirically, Resp-Agent improves diagnostic robustness on cross-domain data and demonstrates that content-aware synthesis substantially outperforms naive augmentation, highlighting the practical potential for data-efficient, edge-aware medical-audio AI and clinician-in-the-loop systems. The work provides a foundation for trustworthy multimodal respiratory AI with accessible code and data for reproducibility under privacy-preserving constraints.

Abstract

Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

TL;DR

Resp-Agent tackles the dual challenges of information loss and data scarcity in respiratory-sound analysis by uniting diagnostic reasoning with controllable audio synthesis in a closed-loop, agent-based framework. It introduces Resp-229k, a large cross-domain benchmark with LLM-generated clinical narratives to enable robust multimodal learning and evaluation under strict domain shifts. The Thinker-ACA planner coordinates a Generator (discrete-unit planning and Flow Matching) and a Diagnoser (modality weaving with Strategic Global Attention and audio anchors), achieving strong gains in minority-class accuracy and macro-F1, while enabling targeted synthesis to rebalance data. Empirically, Resp-Agent improves diagnostic robustness on cross-domain data and demonstrates that content-aware synthesis substantially outperforms naive augmentation, highlighting the practical potential for data-efficient, edge-aware medical-audio AI and clinician-in-the-loop systems. The work provides a foundation for trustworthy multimodal respiratory AI with accessible code and data for reproducibility under privacy-preserving constraints.

Abstract

Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-ACA). Unlike static pipelines, Thinker-ACA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.
Paper Structure (39 sections, 8 equations, 3 figures, 16 tables)

This paper contains 39 sections, 8 equations, 3 figures, 16 tables.

Figures (3)

  • Figure 1: Overview of Resp-Agent. The framework functions as a closed-loop system composed of three interacting modules: (a) Thinker: A compute-aware planner (Thinker-A$^2$CA) that parses semantic intents and routes tasks to other agents based on recycled error profiles and calibrated confidence. (b) Generator: A synthesis module utilizing modality injection to condition the Resp-MLLM on both textual diagnosis and reference acoustic style, decoding discrete units via conditional flow matching. (c) Diagnoser: A clinical inference module employing modality weaving to fuse EHR summaries with audio features early in the network, leveraging sparse global attention for robust cross-modal reasoning.
  • Figure 2: Detailed architecture of Resp-MLLM (Stage 1 of the Generator). The model functions as a style-conditioned multimodal unit generator. Top: A modality injection mechanism fuses textual diagnosis semantics with acoustic style embeddings (projected from temporally pooled BEATs features) to prompt the Qwen3-0.6B-Base backbone. Bottom: A leak-free conditioning strategy is employed during training: random mask sampling ($\mathcal{M} \approx 10\%$) prevents the model from peeking at oracle tokens, ensuring robust autoregressive prediction of discrete acoustic units.
  • Figure 3: Diagnoser Architecture: Modality Weaving with Strategic Global Attention. The framework comprises three key mechanisms: (1) Input-Level Modality Weaving: EHR text tokens and projected audio features (extracted via BEATs) are fused into a single token-aligned sequence at the input layer, enabling native cross-modal interaction. (2) Strategic Global Attention: A Longformer backbone combines efficient sliding-window attention with a sparse set of global tokens, which includes textual sentinels and distributed Audio Anchors to model long-range dependencies with linear complexity. (3) Audio Anchor Mechanism: Anchors act as cross-modal hubs spaced at $\approx$80ms intervals, allowing distinct text symptoms (e.g., "wheeze") to directly query transient acoustic events, thereby capturing fine-grained temporal structures without quadratic computational costs.