Evaluating a Multi-Agent Voice-Enabled Smart Speaker for Care Homes: A Safety-Focused Framework

Zeinab Dehghani; Rameez Raja Kureshi; Koorosh Aslansefat; Faezeh Alsadat Abedi; Dhavalkumar Thakker; Lisa Greaves; Bhupesh Kumar Mishra; Baseer Ahmad; Tanaya Maslekar

Evaluating a Multi-Agent Voice-Enabled Smart Speaker for Care Homes: A Safety-Focused Framework

Zeinab Dehghani, Rameez Raja Kureshi, Koorosh Aslansefat, Faezeh Alsadat Abedi, Dhavalkumar Thakker, Lisa Greaves, Bhupesh Kumar Mishra, Baseer Ahmad, Tanaya Maslekar

Abstract

Artificial intelligence (AI) is increasingly being explored in health and social care to reduce administrative workload and allow staff to spend more time on patient care. This paper evaluates a voice-enabled Care Home Smart Speaker designed to support everyday activities in residential care homes, including spoken access to resident records, reminders, and scheduling tasks. A safety-focused evaluation framework is presented that examines the system end-to-end, combining Whisper-based speech recognition with retrieval-augmented generation (RAG) approaches (hybrid, sparse, and dense). Using supervised care-home trials and controlled testing, we evaluated 330 spoken transcripts across 11 care categories, including 184 reminder-containing interactions. These evaluations focus on (i) correct identification of residents and care categories, (ii) reminder recognition and extraction, and (iii) end-to-end scheduling correctness under uncertainty (including safe deferral/clarification). Given the safety-critical nature of care homes, particular attention is also paid to reliability in noisy environments and across diverse accents, supported by confidence scoring, clarification prompts, and human-in-the-loop oversight. In the best-performing configuration (GPT-5.2), resident ID and care category matching reached 100% (95% CI: 98.86-100), while reminder recognition reached 89.09\% (95% CI: 83.81-92.80) with zero missed reminders (100% recall) but some false positives. End-to-end scheduling via calendar integration achieved 84.65% exact reminder-count agreement (95% CI: 78.00-89.56), indicating remaining edge cases in converting informal spoken instructions into actionable events. The findings suggest that voice-enabled systems, when carefully evaluated and appropriately safeguarded, can support accurate documentation, effective task management, and trustworthy use of AI in care home settings.

Evaluating a Multi-Agent Voice-Enabled Smart Speaker for Care Homes: A Safety-Focused Framework

Abstract

Paper Structure (30 sections, 11 equations, 5 figures, 6 tables)

This paper contains 30 sections, 11 equations, 5 figures, 6 tables.

Introduction
Literature Review
Use of Smart Speakers in Care
Evaluation of Smart Speaker Systems in Care Settings
System Overview and Architecture
End-to-end workflow
System components
Design choices supporting safety and reliability
Problem Definition and Evaluation Objectives
Conceptualising the system as a pipeline
Evaluation Framework and Metrics
Assurance-driven evaluation approach
Evaluation dimensions
Data integrity and retrieval metrics
Resident ID and category accuracy
...and 15 more sections

Figures (5)

Figure 1: System overview and architecture of the voice-enabled care support platform.
Figure 2: Assurance case for the Care Home Smart Speaker. The metrics-based argument A2 justifies the parsing, inserting, and scheduling risk C1.5 using three KPIs evaluated with 95% Wilson confidence intervals: C2.3 database insertion accuracy, C2.4 reminder scheduling success, and C2.5 absence of hallucinations during parsing. Retrieval risk (C1.1) is decomposed by A1 into standard and paraphrased queries, supported by evidence E2.1--E2.3 and defeater D2.2.
Figure 3: Per-category accuracy for GPT 5.2, showing category matching, resident ID matching, and reminder recognition across care categories.
Figure 4: Per-category accuracy for LLaMA--3, showing category matching, resident ID matching, and reminder recognition across care categories.
Figure 5: Per-category accuracy for Qwen, showing category matching, resident ID matching, and reminder recognition across care categories.

Evaluating a Multi-Agent Voice-Enabled Smart Speaker for Care Homes: A Safety-Focused Framework

Abstract

Evaluating a Multi-Agent Voice-Enabled Smart Speaker for Care Homes: A Safety-Focused Framework

Authors

Abstract

Table of Contents

Figures (5)