Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults
Bram van Dijk, Tiberon Kuiper, Sirin Aoulad si Ahmed, Armel Levebvre, Jake Johnson, Jan Duin, Simon Mooijaart, Marco Spruit
TL;DR
The paper addresses the reliability of state-of-the-art ASR for older adults in clinical chatbot contexts and evaluates generic multilingual models against elder-focused Dutch fine-tunes using Welzijn.AI data and Common Voice benchmarks. It shows that generic models generally outperform fine-tuned variants, with a truncated large model ( Whisper-large-v3-turbo ) offering the best balance between accuracy and processing speed for real-time interaction. The findings suggest that out-of-the-box large multilingual ASR can robustly support geriatric chatbots, though inputs with high WER still pose trust and usability risks, highlighting the need for domain-aware mitigation. Limitations include a small sample and privacy constraints, underscoring the need for larger-scale validation before widespread deployment.
Abstract
Voice-controlled interfaces can support older adults in clinical contexts -- with chatbots being a prime example -- but reliable Automatic Speech Recognition (ASR) for underrepresented groups remains a bottleneck. This study evaluates state-of-the-art ASR models on language use of older Dutch adults, who interacted with the Welzijn.AI chatbot designed for geriatric contexts. We benchmark generic multilingual ASR models, and models fine-tuned for Dutch spoken by older adults, while also considering processing speed. Our results show that generic multilingual models outperform fine-tuned models, which suggests recent ASR models can generalise well out of the box to real-world datasets. Moreover, our results indicate that truncating generic models is helpful in balancing the accuracy-speed trade-off. Nonetheless, we also find inputs which cause a high word error rate and place them in context.
