Table of Contents
Fetching ...

UTI-LLM: A Personalized Articulatory-Speech Therapy Assistance System Based on Multimodal Large Language Model

Yudong Yang, Xiaokang Liu, Shaofeng zhao, Rongfeng Su, Nan Yan, Lan Wang

TL;DR

This work tackles the real-time articulatory feedback gap in speech rehabilitation by introducing UTI-LLM, a multimodal large language model that fuses Ultrasound Tongue Imaging (UTI) with speech signals. It builds a domain-specific 10k-scale UTI–speech dialogue dataset and a Temporal-Spatial Dynamic Understanding framework to enable fine-grained tongue-movement analysis and actionable therapy feedback. The architecture combines a speech encoder (Hubert), a UTI visual encoder (CLIP ViT-L/14), and an instruction-tuned LLM (Qwen2.5-7B) to reason over normalized tongue trajectories $\widetilde{\mathcal{T}}$, phonetic content, and diagnostic labels via $R = f_{LLM}(\langle Instruction\rangle, \langle Q_{UTI}\rangle, \langle Q_{Speech}\rangle)$. Experimental results on the AUSpeech dataset show improved tongue-movement description quality, higher dysarthria assessment accuracy and F1, and stronger expert-validated usefulness, underscoring the practical impact of cross-modal alignment for personalized speech rehabilitation.

Abstract

Speech therapy is essential for rehabilitating speech disorders caused by neurological impairments such as stroke. However, traditional manual and computer-assisted systems are limited in real-time accessibility and articulatory motion feedback. Recent advances in multimodal large language models (MLLMs) have demonstrated significant potential in healthcare, especially through their adaptive assessment and therapeutic feedback capabilities. Nevertheless, challenges including insufficient acquisition and fusion of articulatory information, inadequate parsing of articulatory organ motion trajectories, and the scarcity of domain-specific datasets hinder the application of MLLMs in speech therapy. To address these limitations, we propose an MLLM-based speech rehabilitation assistance system that leverages ultrasound tongue imaging and speech signals to deliver precise, interactive articulatory feedback. We construct a high-quality domain-specific dataset comprising ultrasound-speech dialogue pairs. This dataset facilitates fine-tuning to enhance the model's clinical adaptability. Furthermore, our method develops spatiotemporal fusion training strategy of ultrasound videos and speech signals, enabling fine-grained articulatory impairment analysis and ultimately generating actionable feedback. Experimental results demonstrate the effectiveness of our model in articulatory analysis and clinical assessment.

UTI-LLM: A Personalized Articulatory-Speech Therapy Assistance System Based on Multimodal Large Language Model

TL;DR

This work tackles the real-time articulatory feedback gap in speech rehabilitation by introducing UTI-LLM, a multimodal large language model that fuses Ultrasound Tongue Imaging (UTI) with speech signals. It builds a domain-specific 10k-scale UTI–speech dialogue dataset and a Temporal-Spatial Dynamic Understanding framework to enable fine-grained tongue-movement analysis and actionable therapy feedback. The architecture combines a speech encoder (Hubert), a UTI visual encoder (CLIP ViT-L/14), and an instruction-tuned LLM (Qwen2.5-7B) to reason over normalized tongue trajectories , phonetic content, and diagnostic labels via . Experimental results on the AUSpeech dataset show improved tongue-movement description quality, higher dysarthria assessment accuracy and F1, and stronger expert-validated usefulness, underscoring the practical impact of cross-modal alignment for personalized speech rehabilitation.

Abstract

Speech therapy is essential for rehabilitating speech disorders caused by neurological impairments such as stroke. However, traditional manual and computer-assisted systems are limited in real-time accessibility and articulatory motion feedback. Recent advances in multimodal large language models (MLLMs) have demonstrated significant potential in healthcare, especially through their adaptive assessment and therapeutic feedback capabilities. Nevertheless, challenges including insufficient acquisition and fusion of articulatory information, inadequate parsing of articulatory organ motion trajectories, and the scarcity of domain-specific datasets hinder the application of MLLMs in speech therapy. To address these limitations, we propose an MLLM-based speech rehabilitation assistance system that leverages ultrasound tongue imaging and speech signals to deliver precise, interactive articulatory feedback. We construct a high-quality domain-specific dataset comprising ultrasound-speech dialogue pairs. This dataset facilitates fine-tuning to enhance the model's clinical adaptability. Furthermore, our method develops spatiotemporal fusion training strategy of ultrasound videos and speech signals, enabling fine-grained articulatory impairment analysis and ultimately generating actionable feedback. Experimental results demonstrate the effectiveness of our model in articulatory analysis and clinical assessment.

Paper Structure

This paper contains 11 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Agent QA Framework for Medical Dialogue Generation. The schematic illustrates the interactive pipeline of our datasets, which simulates real-world clinical consultations.
  • Figure 2: Architecture of UTI-LLM Multimodal Speech Rehabilitation. The architecture enables joint reasoning over articulatory movement, acoustic and UTI visual cues for speech rehabilitation.
  • Figure 3: Ablation experiments with different module configurations.(demonstrated the necessity of our UTI and speech)