Identifying the Desired Word Suggestion in Simultaneous Audio

Dylan Gaines; Keith Vertanen

Identifying the Desired Word Suggestion in Simultaneous Audio

Dylan Gaines, Keith Vertanen

TL;DR

This work addresses non-visual, audio-only word suggestions by evaluating the ability to detect and identify a target word spoken among simultaneous TTS voices. Through two perceptual studies, it shows that simultaneous presentation degrades accuracy as more words are added, but introducing a small start-time delay between voices substantially mitigates this, enabling two-voice simultaneous playback to approach sequential accuracy while increasing speed. The key contributions are the first quantitative analysis of single-word discrimination with concurrent TTS, a comparison of spatial configurations, and the demonstration that a 0.15–0.25 s delay yields near-sequential performance and faster word suggestion delivery. The findings have practical implications for eyes-free text entry by enabling faster feedback with minimal loss in accuracy, informing design choices like limiting to two simultaneous suggestions and applying small delays to improve intelligibility.

Abstract

We explore a method for presenting word suggestions for non-visual text input using simultaneous voices. We conduct two perceptual studies and investigate the impact of different presentations of voices on a user's ability to detect which voice, if any, spoke their desired word. Our sets of words simulated the word suggestions of a predictive keyboard during real-world text input. We find that when voices are simultaneous, user accuracy decreases significantly with each added word suggestion. However, adding a slight 0.15 s delay between the start of each subsequent word allows two simultaneous words to be presented with no significant decrease in accuracy compared to presenting two words sequentially (84% simultaneous versus 86% sequential). This allows two word suggestions to be presented to the user 32% faster than sequential playback without decreasing accuracy.

Identifying the Desired Word Suggestion in Simultaneous Audio

TL;DR

Abstract

Paper Structure (12 sections, 4 figures, 3 tables)

This paper contains 12 sections, 4 figures, 3 tables.

Introduction
Related Work
Study 1
Procedure
Results
Discussion
Study 2
Procedure
Results
Discussion
Limitations
Conclusion

Figures (4)

Figure 1: The web interface at the beginning of the Sequential condition. The text below the Voice buttons as well as the Continue button did not appear until all four Voice buttons have been clicked. This screen is meant to familiarize users with which voice corresponds to which number.
Figure 2: The web interface once the participant had clicked 'Ready' and the audio had played. The participant now selects which voice they heard say the target word.
Figure 3: The average accuracy of participants in Study 1. Error bars represent standard error of the mean.
Figure 4: The average accuracy of participants in Study 2. Error bars represent standard error of the mean.

Identifying the Desired Word Suggestion in Simultaneous Audio

TL;DR

Abstract

Identifying the Desired Word Suggestion in Simultaneous Audio

Authors

TL;DR

Abstract

Table of Contents

Figures (4)