Table of Contents
Fetching ...

Desirable Unfamiliarity: Insights from Eye Movements on Engagement and Readability of Dictation Interfaces

Zhaohui Liang, Yonglin Chen, Naser Al Madi, Can Liu

TL;DR

The paper addresses the challenge of making spoken transcripts readable without sacrificing real-time feedback in dictation interfaces. It employs an eye-tracking study with 20 participants across five interfaces (PLAIN, AOC, RAKE, GP-TSM, SUMMARY) to compare composition and review phases, using both quantitative gaze metrics and qualitative feedback. Key findings include that only 7–11% of production time is spent reading, abstractive LLM-generated summaries reduce reading effort and are preferred, while simple keyword highlighting (RAKE) can be more effective than grammar-preserving highlights; unfamiliar phrasing from summaries is tolerated when gist is preserved. The study highlights Desirable Unfamiliarity as a design principle, suggesting interfaces should favor gist representations and recall-facilitating highlights over verbatim fidelity, with implications for AI-assisted transcription in meetings, lectures, and beyond.

Abstract

Transcripts displayed on dictation interfaces can be hard to read due to recognition errors and disfluencies. LLM-based text auto-correction could help, but changing the text during production could lead to distraction and unintended phrasing. To understand how to balance readability, attention, and accuracy, we conducted an eye-tracking experiment with 20 participants to compare five dictation interfaces: PLAIN (real-time transcription), AOC (periodic corrections), RAKE (keyword highlights), GP-TSM (grammar-preserving highlights), and SUMMARY (LLM-generated abstractive summary). By analyzing participants' gaze patterns during speech composition and reviewing processes, we found that during composition, participants spent only 7-11% of their time in active reading regardless of the interface. Although SUMMARY introduced unfamiliar words and phrasing during composition, it was easier to read and more preferred by participants. Our findings suggest a high user tolerance for altering spoken words in LLM-enabled diction interfaces.

Desirable Unfamiliarity: Insights from Eye Movements on Engagement and Readability of Dictation Interfaces

TL;DR

The paper addresses the challenge of making spoken transcripts readable without sacrificing real-time feedback in dictation interfaces. It employs an eye-tracking study with 20 participants across five interfaces (PLAIN, AOC, RAKE, GP-TSM, SUMMARY) to compare composition and review phases, using both quantitative gaze metrics and qualitative feedback. Key findings include that only 7–11% of production time is spent reading, abstractive LLM-generated summaries reduce reading effort and are preferred, while simple keyword highlighting (RAKE) can be more effective than grammar-preserving highlights; unfamiliar phrasing from summaries is tolerated when gist is preserved. The study highlights Desirable Unfamiliarity as a design principle, suggesting interfaces should favor gist representations and recall-facilitating highlights over verbatim fidelity, with implications for AI-assisted transcription in meetings, lectures, and beyond.

Abstract

Transcripts displayed on dictation interfaces can be hard to read due to recognition errors and disfluencies. LLM-based text auto-correction could help, but changing the text during production could lead to distraction and unintended phrasing. To understand how to balance readability, attention, and accuracy, we conducted an eye-tracking experiment with 20 participants to compare five dictation interfaces: PLAIN (real-time transcription), AOC (periodic corrections), RAKE (keyword highlights), GP-TSM (grammar-preserving highlights), and SUMMARY (LLM-generated abstractive summary). By analyzing participants' gaze patterns during speech composition and reviewing processes, we found that during composition, participants spent only 7-11% of their time in active reading regardless of the interface. Although SUMMARY introduced unfamiliar words and phrasing during composition, it was easier to read and more preferred by participants. Our findings suggest a high user tolerance for altering spoken words in LLM-enabled diction interfaces.

Paper Structure

This paper contains 74 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Illustration of the mechanism of Accumulative Offline Correction(AOC). The interface displays raw transcripts in real-time, while the text gets periodically replaced by corrected text.
  • Figure 2: Experiment environment and procedure. Figure (a) shows the experiment setup, featuring a participant seated with a posture corrector, a microphone on the desk, and a 24-inch monitor with a Tobii Pro Spark eye-tracker, capturing gaze data at 60 Hz. Figure (b) shows the procedure of experiment tasks. For each Interface there are five dictation tasks, each testing one Interface. Each dictation task begins with a training and preparation phase, then a data collection phase involving two steps: speak and review (reading). Two external review (reading) tasks of GPT-generated text are inserted between dictation tasks in counterbalanced order.
  • Figure 3: Examples of drift correction where fixation positions are reattached to their original line of text. As shown in Figures (a) and (b), after drift correction, the drifted gaze points were adjusted, aligning the fixations to the text tokens. This is needed to accurately measure eye movement metrics over tokens of text.
  • Figure 4: Illustrating eye-movement engagement metrics, Figure (a) shows the distinction between fixations on-text and fixations off-text. Figure (b) shows the distinction between sustained reading where fixations move from left to right in a typical linear reading order, and hopping where the eyes make jumps across the text.
  • Figure 5: Gaze Engagement during speech production and review across five Interface conditions. a): Average percentages of gaze points On Text (in grey) and Off Text (in green) during speech production. Two shades of green show percentage of Sustained Reading and Hopping. b): Detailed Breakdown of On Text Gaze Points distinguishing between Local and Distal from the position of production (cursor position). c) Average percentages of gaze points On Text (in grey) and Off Text (in green) during reviewing.
  • ...and 6 more figures