Table of Contents
Fetching ...

VoiceAlign: A Shimming Layer for Enhancing the Usability of Legacy Voice User Interface Systems

Md Ehtesham-Ul-Haque, Syed Masum Billah

TL;DR

VoiceAlign, an adaptive shimming layer that mediates between users and legacy VUI systems without requiring system modifications, demonstrates how modern AI techniques can unlock the underutilized potential of legacy VUI systems without requiring system modifications.

Abstract

Voice user interfaces (VUIs) are rapidly transitioning from accessibility features to mainstream interaction modalities. Yet most operating systems' built-in voice commands remain underutilized despite possessing robust technical capabilities. Through our analysis of four commercial VUI systems and a formative study with 16 participants, we found that fixed command formats require exact phrasing, restrictive timeout mechanisms discard input during planning pauses, and insufficient feedback hampers multi-step interactions. To address these challenges, we developed VoiceAlign, an adaptive shimming layer that mediates between users and legacy VUI systems. VoiceAlign intercepts natural voice commands, transforms them to match the required syntax using a large language model, and transmits these adapted commands through a virtual audio channel that remains transparent to the underlying system. In our evaluation with 12 participants, VoiceAlign reduced command failures by half, required 25% fewer commands per task, and significantly lowered cognitive and temporal demands when paired with an existing legacy VUI system. Furthermore, we created a synthetic dataset informed by our studies and fine-tuned a small language model that achieves over 90% accuracy with 200 ms response time when served locally, eliminating dependence on third-party APIs while enabling real-time interaction on edge devices. This work demonstrates how modern AI techniques can unlock the underutilized potential of legacy VUI systems without requiring system modifications, offering a practical solution without replacing existing infrastructure.

VoiceAlign: A Shimming Layer for Enhancing the Usability of Legacy Voice User Interface Systems

TL;DR

VoiceAlign, an adaptive shimming layer that mediates between users and legacy VUI systems without requiring system modifications, demonstrates how modern AI techniques can unlock the underutilized potential of legacy VUI systems without requiring system modifications.

Abstract

Voice user interfaces (VUIs) are rapidly transitioning from accessibility features to mainstream interaction modalities. Yet most operating systems' built-in voice commands remain underutilized despite possessing robust technical capabilities. Through our analysis of four commercial VUI systems and a formative study with 16 participants, we found that fixed command formats require exact phrasing, restrictive timeout mechanisms discard input during planning pauses, and insufficient feedback hampers multi-step interactions. To address these challenges, we developed VoiceAlign, an adaptive shimming layer that mediates between users and legacy VUI systems. VoiceAlign intercepts natural voice commands, transforms them to match the required syntax using a large language model, and transmits these adapted commands through a virtual audio channel that remains transparent to the underlying system. In our evaluation with 12 participants, VoiceAlign reduced command failures by half, required 25% fewer commands per task, and significantly lowered cognitive and temporal demands when paired with an existing legacy VUI system. Furthermore, we created a synthetic dataset informed by our studies and fine-tuned a small language model that achieves over 90% accuracy with 200 ms response time when served locally, eliminating dependence on third-party APIs while enabling real-time interaction on edge devices. This work demonstrates how modern AI techniques can unlock the underutilized potential of legacy VUI systems without requiring system modifications, offering a practical solution without replacing existing infrastructure.
Paper Structure (54 sections, 8 figures, 3 tables)

This paper contains 54 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: The fully quantified command template with four components and a sampler of available values for each component. An icon indicates the timer that specifies the threshold between two utterances to be considered part of the same instruction. Note that the timer applies to both inter- and intra-component utterances.
  • Figure 2: Six valid combinations of the four components with example commands for each combination.
  • Figure 3: Study setup showing (a) the physical arrangement with the participant seated before the laptop running Voice Control and (b) the interface displaying a sample correction task with target text and prompt.
  • Figure 4: Common instances of command sequences used by participants to accomplish corrections by insertions on the left and the workflow of the system on the right. An asterisk (*) indicates the optimal command sequences. The target text and the current state are shown at the top right. An underline is used to indicate where editing occurred. A square brace ([]) indicates a temporary buffer the system uses internally (e.g., all potential insertions until disambiguated).
  • Figure 5: VoiceAlign interface containing an indicator when the microphone is active, an input box to display the transcript of users' uttered commands in real-time, and the output from the LLM providing the correct command or a list of suggestions. (Left) An example of a command correction, where 'Select the word Apple' is transformed to the syntactically valid 'select apple' by removing extraneous words while preserving core components. (Right) An example of a command suggestion, where 'Insert the word Apple' lacks required context parameters, prompting the system to offer structured guidance on how to complete the command.
  • ...and 3 more figures