Table of Contents
Fetching ...

RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering

Gaia A. Bertolino, Yuwei Zhang, Tong Xia, Domenico Talia, Cecilia Mascolo

TL;DR

RAMoEA-QA is introduced, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system.

Abstract

Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challenge is particularly acute: recordings vary widely across devices, environments, and acquisition protocols, and questions span multiple intents and question formats. Existing biomedical audio-language QA systems are typically monolithic, without any specialization mechanisms for tackling diverse respiratory corpora and query intents. They are also only validated in limited settings, leaving it unclear how reliably they handle the shifts encountered in real-world settings. To address these limitations, we introduce RAMoEA-QA, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system. RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained audio encoder, and a Language Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. By specializing both acoustic representations and generation behaviour per example, RAMoEA-QA consistently outperforms strong baselines and routing ablations with minimal parameter overhead, improving in-domain test accuracy to 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibiting the strongest generalization for diagnosis under domain, modality, and task shifts.

RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering

TL;DR

RAMoEA-QA is introduced, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system.

Abstract

Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challenge is particularly acute: recordings vary widely across devices, environments, and acquisition protocols, and questions span multiple intents and question formats. Existing biomedical audio-language QA systems are typically monolithic, without any specialization mechanisms for tackling diverse respiratory corpora and query intents. They are also only validated in limited settings, leaving it unclear how reliably they handle the shifts encountered in real-world settings. To address these limitations, we introduce RAMoEA-QA, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system. RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained audio encoder, and a Language Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. By specializing both acoustic representations and generation behaviour per example, RAMoEA-QA consistently outperforms strong baselines and routing ablations with minimal parameter overhead, improving in-domain test accuracy to 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibiting the strongest generalization for diagnosis under domain, modality, and task shifts.
Paper Structure (58 sections, 5 equations, 12 figures, 13 tables)

This paper contains 58 sections, 5 equations, 12 figures, 13 tables.

Figures (12)

  • Figure 1: Two-stage routing for RA-QA. (A) The MoE selects an audio expert (encoder). The resulting aligned audio embeddings are injected as a selected audio prefix. (B) The MoA selects a LoRA adapter for the language model during generation.
  • Figure 2: Tolerance accuracy for regression. Accuracy as a function of absolute-error tolerance $\epsilon$ for spirometry targets (FVC/FEV1 on MM-Lung) and respiratory rate (NoseMic), comparing a single-path baseline against two-stage routing. Two-stage routing reaches higher accuracy at tighter tolerances, indicating fewer large prediction errors.
  • Figure 3: Unified routing heatmap across datasets, question formats, and diagnosis categories. Columns are grouped into Datasets (green), Question types (red), Diagnosis labels (blue), and Tasks (orange), while rows correspond to the four experts (operaCT, operaGT, LoRA expert 1, LoRA expert 2).
  • Figure 4: The figure shows label distributions for viral load categories (multiclass), as well as binary labels for symptoms such as runny or blocked nose and conditions like asthma from the UK COVID-19 dataset. As with other datasets used in this work, these labels are highly imbalanced and require preprocessing and reduction strategies to ensure meaningful training and evaluation.
  • Figure 5: Label distributions from the CoughVid dataset, illustrating examples of audio-related attributes such as cough type, presence of stridor and associated diagnoses. This highlights that, beyond clinical metadata, some datasets also include perceptual or acoustic labels (e.g., wheezes, stridors), which are directly linked to the audio signal and can support more fine-grained sound analysis.
  • ...and 7 more figures