Table of Contents
Fetching ...

Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults

Bram van Dijk, Tiberon Kuiper, Sirin Aoulad si Ahmed, Armel Levebvre, Jake Johnson, Jan Duin, Simon Mooijaart, Marco Spruit

TL;DR

The paper addresses the reliability of state-of-the-art ASR for older adults in clinical chatbot contexts and evaluates generic multilingual models against elder-focused Dutch fine-tunes using Welzijn.AI data and Common Voice benchmarks. It shows that generic models generally outperform fine-tuned variants, with a truncated large model ( Whisper-large-v3-turbo ) offering the best balance between accuracy and processing speed for real-time interaction. The findings suggest that out-of-the-box large multilingual ASR can robustly support geriatric chatbots, though inputs with high WER still pose trust and usability risks, highlighting the need for domain-aware mitigation. Limitations include a small sample and privacy constraints, underscoring the need for larger-scale validation before widespread deployment.

Abstract

Voice-controlled interfaces can support older adults in clinical contexts -- with chatbots being a prime example -- but reliable Automatic Speech Recognition (ASR) for underrepresented groups remains a bottleneck. This study evaluates state-of-the-art ASR models on language use of older Dutch adults, who interacted with the Welzijn.AI chatbot designed for geriatric contexts. We benchmark generic multilingual ASR models, and models fine-tuned for Dutch spoken by older adults, while also considering processing speed. Our results show that generic multilingual models outperform fine-tuned models, which suggests recent ASR models can generalise well out of the box to real-world datasets. Moreover, our results indicate that truncating generic models is helpful in balancing the accuracy-speed trade-off. Nonetheless, we also find inputs which cause a high word error rate and place them in context.

Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults

TL;DR

The paper addresses the reliability of state-of-the-art ASR for older adults in clinical chatbot contexts and evaluates generic multilingual models against elder-focused Dutch fine-tunes using Welzijn.AI data and Common Voice benchmarks. It shows that generic models generally outperform fine-tuned variants, with a truncated large model ( Whisper-large-v3-turbo ) offering the best balance between accuracy and processing speed for real-time interaction. The findings suggest that out-of-the-box large multilingual ASR can robustly support geriatric chatbots, though inputs with high WER still pose trust and usability risks, highlighting the need for domain-aware mitigation. Limitations include a small sample and privacy constraints, underscoring the need for larger-scale validation before widespread deployment.

Abstract

Voice-controlled interfaces can support older adults in clinical contexts -- with chatbots being a prime example -- but reliable Automatic Speech Recognition (ASR) for underrepresented groups remains a bottleneck. This study evaluates state-of-the-art ASR models on language use of older Dutch adults, who interacted with the Welzijn.AI chatbot designed for geriatric contexts. We benchmark generic multilingual ASR models, and models fine-tuned for Dutch spoken by older adults, while also considering processing speed. Our results show that generic multilingual models outperform fine-tuned models, which suggests recent ASR models can generalise well out of the box to real-world datasets. Moreover, our results indicate that truncating generic models is helpful in balancing the accuracy-speed trade-off. Nonetheless, we also find inputs which cause a high word error rate and place them in context.

Paper Structure

This paper contains 7 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Interface of Welzijn.AI with an example conversation. Users press the purple button to activate the ASR functionality and start responding, after which their speech is transcribed and rendered on the screen. Chatbot responses are read out with a text-to-speech model. The 'Scores' button shows information extracted on quality of life and frailty, 'Settings' allows choosing different ASR models, and 'Appearance' returns the user to the conversation on display. We focus in this paper on the conversation resulting from interaction with this prototype.
  • Figure 2: Overview of accuracy vs. processing time.