Table of Contents
Fetching ...

ScholaWrite: A Dataset of End-to-End Scholarly Writing Process

Khanh Chi Le, Linghe Wang, Minhwa Lee, Ross Volkov, Luan Tuyen Chau, Dongyeop Kang

TL;DR

ScholaWrite dataset is introduced, a first-of-its-kind keystroke corpus of an end-to-end scholarly writing process for complete manuscripts, with thorough annotations of cognitive writing intentions behind each keystroke, demonstrating the importance of collection of end-to-end writing data, rather than the final manuscript, for the development of future writing assistants to support the cognitive thinking process of scientists.

Abstract

Writing is a cognitively demanding activity that requires constant decision-making, heavy reliance on working memory, and frequent shifts between tasks of different goals. To build writing assistants that truly align with writers' cognition, we must capture and decode the complete thought process behind how writers transform ideas into final texts. We present ScholaWrite, the first dataset of end-to-end scholarly writing, tracing the multi-month journey from initial drafts to final manuscripts. We contribute three key advances: (1) a Chrome extension that unobtrusively records keystrokes on Overleaf, enabling the collection of realistic, in-situ writing data; (2) a novel corpus of full scholarly manuscripts, enriched with fine-grained annotations of cognitive writing intentions. The dataset includes \LaTeX-based edits from five computer science preprints, capturing nearly 62K text changes over four months; and (3) analyses and insights into the micro-dynamics of scholarly writing, highlighting gaps between human writing processes and the current capabilities of large language models (LLMs) in providing meaningful assistance. ScholaWrite underscores the value of capturing end-to-end writing data to develop future writing assistants that support, not replace, the cognitive work of scientists.

ScholaWrite: A Dataset of End-to-End Scholarly Writing Process

TL;DR

ScholaWrite dataset is introduced, a first-of-its-kind keystroke corpus of an end-to-end scholarly writing process for complete manuscripts, with thorough annotations of cognitive writing intentions behind each keystroke, demonstrating the importance of collection of end-to-end writing data, rather than the final manuscript, for the development of future writing assistants to support the cognitive thinking process of scientists.

Abstract

Writing is a cognitively demanding activity that requires constant decision-making, heavy reliance on working memory, and frequent shifts between tasks of different goals. To build writing assistants that truly align with writers' cognition, we must capture and decode the complete thought process behind how writers transform ideas into final texts. We present ScholaWrite, the first dataset of end-to-end scholarly writing, tracing the multi-month journey from initial drafts to final manuscripts. We contribute three key advances: (1) a Chrome extension that unobtrusively records keystrokes on Overleaf, enabling the collection of realistic, in-situ writing data; (2) a novel corpus of full scholarly manuscripts, enriched with fine-grained annotations of cognitive writing intentions. The dataset includes \LaTeX-based edits from five computer science preprints, capturing nearly 62K text changes over four months; and (3) analyses and insights into the micro-dynamics of scholarly writing, highlighting gaps between human writing processes and the current capabilities of large language models (LLMs) in providing meaningful assistance. ScholaWrite underscores the value of capturing end-to-end writing data to develop future writing assistants that support, not replace, the cognitive work of scientists.

Paper Structure

This paper contains 83 sections, 27 figures, 18 tables.

Figures (27)

  • Figure 1: An example scholarly writing process with annotated writing intents in ScholaWrite: it is iterative, non-linear, and switches frequently between multiple activities, tools, and intents over a long range of time.
  • Figure 2: Transition probability matrix between writing intentions. Each cell shows the likelihood that a session with the current intention (y-axis) is followed by a session with the next intention (x-axis).
  • Figure 3: The number of intentions per writing session
  • Figure 4: Dynamics across early and late phases of writing. (a) The share of time devoted to each intention shifts from planning to revision as writing progresses. (b) Later sessions involve more overlapping intentions (blue for 1-2, orange for 3-5, and green for >5 intentions), reflecting higher cognitive integration.
  • Figure 5: Model alignment patterns. (a; minutes-vs-alignment) Longer writing sessions show lower alignment, indicating higher cognitive complexity. (b; #-intents-vs-alignment) Alignment decreases as more intentions intertwine within a session.
  • ...and 22 more figures