Identifying the Desired Word Suggestion in Simultaneous Audio
Dylan Gaines, Keith Vertanen
TL;DR
This work addresses non-visual, audio-only word suggestions by evaluating the ability to detect and identify a target word spoken among simultaneous TTS voices. Through two perceptual studies, it shows that simultaneous presentation degrades accuracy as more words are added, but introducing a small start-time delay between voices substantially mitigates this, enabling two-voice simultaneous playback to approach sequential accuracy while increasing speed. The key contributions are the first quantitative analysis of single-word discrimination with concurrent TTS, a comparison of spatial configurations, and the demonstration that a 0.15–0.25 s delay yields near-sequential performance and faster word suggestion delivery. The findings have practical implications for eyes-free text entry by enabling faster feedback with minimal loss in accuracy, informing design choices like limiting to two simultaneous suggestions and applying small delays to improve intelligibility.
Abstract
We explore a method for presenting word suggestions for non-visual text input using simultaneous voices. We conduct two perceptual studies and investigate the impact of different presentations of voices on a user's ability to detect which voice, if any, spoke their desired word. Our sets of words simulated the word suggestions of a predictive keyboard during real-world text input. We find that when voices are simultaneous, user accuracy decreases significantly with each added word suggestion. However, adding a slight 0.15 s delay between the start of each subsequent word allows two simultaneous words to be presented with no significant decrease in accuracy compared to presenting two words sequentially (84% simultaneous versus 86% sequential). This allows two word suggestions to be presented to the user 32% faster than sequential playback without decreasing accuracy.
