Table of Contents
Fetching ...

Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding

Tsai-Ning Wang, Lin-Lin Chen, Neil Zeghidour, Aaqib Saeed

TL;DR

The paper tackles the lack of clinical semantics in pre-trained medical audio models by introducing AcuLa, a post-training framework that treats a frozen medical LLM as a semantic teacher to ground audio representations. It implements a lightweight, preservation-focused speaker-student setup with two projection heads and optimizes a dual objective combining CK A-based semantic alignment with a self-supervised acoustic loss, aided by synthetic 100k+ audio–text pairs generated from metadata. Empirically, AcuLa achieves state-of-the-art results across 18 cardio-respiratory tasks from 10 datasets, boosting mean AUROC from 0.68 to 0.79 and elevating COVID-19 cough detection AUROC to 0.89, while remaining model-agnostic. The work demonstrates a novel direction in cross-modal learning, showing that–with a semantic teacher–audio encoders can internalize clinically meaningful concepts, enabling more accurate and robust health-monitoring from acoustic signals.

Abstract

Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance, limiting their use and performance in diagnostic tasks. To bridge this gap, we introduce AcuLa (Audio-Clinical Understanding via Language Alignment), a lightweight post-training framework that instills semantic understanding into any audio encoder by aligning it with a medical language model, which acts as a "semantic teacher." To enable alignment at scale, we construct a large-scale dataset by leveraging off-the-shelf large language models to translate the rich, structured metadata accompanying existing audio recordings into coherent clinical reports. Our alignment strategy combines a representation-level contrastive objective with a self-supervised modeling, ensuring that the model learns clinical semantics while preserving fine-grained temporal cues. AcuLa achieves state-of-the-art results across 18 diverse cardio-respiratory tasks from 10 different datasets, improving the mean AUROC on classification benchmarks from 0.68 to 0.79 and, on the most challenging COVID-19 cough detection task, boosting the AUROC from 0.55 to 0.89. Our work demonstrates that this audio-language alignment transforms purely acoustic models into clinically-aware diagnostic tools, establishing a novel paradigm for enhancing physiological understanding in audio-based health monitoring.

Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding

TL;DR

The paper tackles the lack of clinical semantics in pre-trained medical audio models by introducing AcuLa, a post-training framework that treats a frozen medical LLM as a semantic teacher to ground audio representations. It implements a lightweight, preservation-focused speaker-student setup with two projection heads and optimizes a dual objective combining CK A-based semantic alignment with a self-supervised acoustic loss, aided by synthetic 100k+ audio–text pairs generated from metadata. Empirically, AcuLa achieves state-of-the-art results across 18 cardio-respiratory tasks from 10 datasets, boosting mean AUROC from 0.68 to 0.79 and elevating COVID-19 cough detection AUROC to 0.89, while remaining model-agnostic. The work demonstrates a novel direction in cross-modal learning, showing that–with a semantic teacher–audio encoders can internalize clinically meaningful concepts, enabling more accurate and robust health-monitoring from acoustic signals.

Abstract

Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance, limiting their use and performance in diagnostic tasks. To bridge this gap, we introduce AcuLa (Audio-Clinical Understanding via Language Alignment), a lightweight post-training framework that instills semantic understanding into any audio encoder by aligning it with a medical language model, which acts as a "semantic teacher." To enable alignment at scale, we construct a large-scale dataset by leveraging off-the-shelf large language models to translate the rich, structured metadata accompanying existing audio recordings into coherent clinical reports. Our alignment strategy combines a representation-level contrastive objective with a self-supervised modeling, ensuring that the model learns clinical semantics while preserving fine-grained temporal cues. AcuLa achieves state-of-the-art results across 18 diverse cardio-respiratory tasks from 10 different datasets, improving the mean AUROC on classification benchmarks from 0.68 to 0.79 and, on the most challenging COVID-19 cough detection task, boosting the AUROC from 0.55 to 0.89. Our work demonstrates that this audio-language alignment transforms purely acoustic models into clinically-aware diagnostic tools, establishing a novel paradigm for enhancing physiological understanding in audio-based health monitoring.

Paper Structure

This paper contains 25 sections, 5 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: Performance comparison of audio-based models. (a) Average AUROC for respiratory classification tasks (T1-T9). (b) Average MAE for lung function estimation tasks (T10-T16). Our model (pink) outperforms all baselines, achieving the highest AUROC (0.79) and lowest MAE (0.82).
  • Figure 2: Architecture of the audio-language alignment framework. (A) Audio encoders extract features from clinical recordings, which are aligned with language representations via similarity matching. (B) Down‑stream tasks enabled by the aligned model, including (i) respiratory‑health classification (9 tasks), (ii) cardiac‑condition detection (2 tasks) and (iii) lung‑function estimation (7 tasks).
  • Figure 3: Spectrograms of cardiopulmonary sounds with paired clinical reports. (a) Rhonchi showing continuous adventitious sounds from airway obstruction. (b) Holosystolic murmur indicating mitral valve pathology. (c) Normal breath sounds with clear pulmonary function. (d) Wheezes revealing airway constriction associated with asthma or COPD.
  • Figure 4: Top‑3 clinical reports retrieved for auscultation clips. Left: query spectrogram+reference report. Right: three closest matches returned by our audio–text model.