Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

Pengfei Zhang; Tianxin Xie; Minghao Yang; Li Liu

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu

TL;DR

Resp-Agent tackles the dual challenges of information loss and data scarcity in respiratory-sound analysis by uniting diagnostic reasoning with controllable audio synthesis in a closed-loop, agent-based framework. It introduces Resp-229k, a large cross-domain benchmark with LLM-generated clinical narratives to enable robust multimodal learning and evaluation under strict domain shifts. The Thinker-A$^2$CA planner coordinates a Generator (discrete-unit planning and Flow Matching) and a Diagnoser (modality weaving with Strategic Global Attention and audio anchors), achieving strong gains in minority-class accuracy and macro-F1, while enabling targeted synthesis to rebalance data. Empirically, Resp-Agent improves diagnostic robustness on cross-domain data and demonstrates that content-aware synthesis substantially outperforms naive augmentation, highlighting the practical potential for data-efficient, edge-aware medical-audio AI and clinician-in-the-loop systems. The work provides a foundation for trustworthy multimodal respiratory AI with accessible code and data for reproducibility under privacy-preserving constraints.

Abstract

Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

TL;DR

CA planner coordinates a Generator (discrete-unit planning and Flow Matching) and a Diagnoser (modality weaving with Strategic Global Attention and audio anchors), achieving strong gains in minority-class accuracy and macro-F1, while enabling targeted synthesis to rebalance data. Empirically, Resp-Agent improves diagnostic robustness on cross-domain data and demonstrates that content-aware synthesis substantially outperforms naive augmentation, highlighting the practical potential for data-efficient, edge-aware medical-audio AI and clinician-in-the-loop systems. The work provides a foundation for trustworthy multimodal respiratory AI with accessible code and data for reproducibility under privacy-preserving constraints.

Abstract

CA). Unlike static pipelines, Thinker-A

CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.

Paper Structure (39 sections, 8 equations, 3 figures, 16 tables)

This paper contains 39 sections, 8 equations, 3 figures, 16 tables.

Introduction
Related Work
Resp-229k: A Large-Scale, Multi-Source, Cross-Domain Benchmark
Resp-Agent: An LLM-Orchestrated Loop for Unified Diagnosis and Controllable Synthesis
Generator: Discrete-Unit Planning and CFM Reconstruction
Style-Conditioned Unit Modeling with a Retooled LLM
Conditional Flow Matching for High-Fidelity Waveforms
Diagnoser: Modality Weaving with Strategic Global Attention
Input-Level Modality Weaving
Strategic Global Attention
Experiments
Main diagnostic performance.
Conclusion
Data Provenance and Privacy.
Licensing and Compliance.
...and 24 more sections

Figures (3)

Figure 1: Overview of Resp-Agent. The framework functions as a closed-loop system composed of three interacting modules: (a) Thinker: A compute-aware planner (Thinker-A$^2$CA) that parses semantic intents and routes tasks to other agents based on recycled error profiles and calibrated confidence. (b) Generator: A synthesis module utilizing modality injection to condition the Resp-MLLM on both textual diagnosis and reference acoustic style, decoding discrete units via conditional flow matching. (c) Diagnoser: A clinical inference module employing modality weaving to fuse EHR summaries with audio features early in the network, leveraging sparse global attention for robust cross-modal reasoning.
Figure 2: Detailed architecture of Resp-MLLM (Stage 1 of the Generator). The model functions as a style-conditioned multimodal unit generator. Top: A modality injection mechanism fuses textual diagnosis semantics with acoustic style embeddings (projected from temporally pooled BEATs features) to prompt the Qwen3-0.6B-Base backbone. Bottom: A leak-free conditioning strategy is employed during training: random mask sampling ($\mathcal{M} \approx 10\%$) prevents the model from peeking at oracle tokens, ensuring robust autoregressive prediction of discrete acoustic units.
Figure 3: Diagnoser Architecture: Modality Weaving with Strategic Global Attention. The framework comprises three key mechanisms: (1) Input-Level Modality Weaving: EHR text tokens and projected audio features (extracted via BEATs) are fused into a single token-aligned sequence at the input layer, enabling native cross-modal interaction. (2) Strategic Global Attention: A Longformer backbone combines efficient sliding-window attention with a sparse set of global tokens, which includes textual sentinels and distributed Audio Anchors to model long-range dependencies with linear complexity. (3) Audio Anchor Mechanism: Anchors act as cross-modal hubs spaced at $\approx$80ms intervals, allowing distinct text symptoms (e.g., "wheeze") to directly query transient acoustic events, thereby capturing fine-grained temporal structures without quadratic computational costs.

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

TL;DR

Abstract

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

Authors

TL;DR

Abstract

Table of Contents

Figures (3)