Desirable Unfamiliarity: Insights from Eye Movements on Engagement and Readability of Dictation Interfaces
Zhaohui Liang, Yonglin Chen, Naser Al Madi, Can Liu
TL;DR
The paper addresses the challenge of making spoken transcripts readable without sacrificing real-time feedback in dictation interfaces. It employs an eye-tracking study with 20 participants across five interfaces (PLAIN, AOC, RAKE, GP-TSM, SUMMARY) to compare composition and review phases, using both quantitative gaze metrics and qualitative feedback. Key findings include that only 7–11% of production time is spent reading, abstractive LLM-generated summaries reduce reading effort and are preferred, while simple keyword highlighting (RAKE) can be more effective than grammar-preserving highlights; unfamiliar phrasing from summaries is tolerated when gist is preserved. The study highlights Desirable Unfamiliarity as a design principle, suggesting interfaces should favor gist representations and recall-facilitating highlights over verbatim fidelity, with implications for AI-assisted transcription in meetings, lectures, and beyond.
Abstract
Transcripts displayed on dictation interfaces can be hard to read due to recognition errors and disfluencies. LLM-based text auto-correction could help, but changing the text during production could lead to distraction and unintended phrasing. To understand how to balance readability, attention, and accuracy, we conducted an eye-tracking experiment with 20 participants to compare five dictation interfaces: PLAIN (real-time transcription), AOC (periodic corrections), RAKE (keyword highlights), GP-TSM (grammar-preserving highlights), and SUMMARY (LLM-generated abstractive summary). By analyzing participants' gaze patterns during speech composition and reviewing processes, we found that during composition, participants spent only 7-11% of their time in active reading regardless of the interface. Although SUMMARY introduced unfamiliar words and phrasing during composition, it was easier to read and more preferred by participants. Our findings suggest a high user tolerance for altering spoken words in LLM-enabled diction interfaces.
