Table of Contents
Fetching ...

CorpusStudio: Surfacing Emergent Patterns in a Corpus of Prior Work while Writing

Hai Dang, Chelse Swoopes, Daniel Buschek, Elena L. Glassman

TL;DR

CorpusStudio tackles the challenge of externalizing implicit community writing norms by surfacing emergent patterns from a corpus of prior work during writing. It introduces two core concepts: a document-level ordered distribution over section titles and a sentence-level retrieval of analogous examples with highlighting to reveal norms and outliers, all integrated into a single writing interface. In a controlled study with 16 participants, users drafting outlines and manuscript sections reported increased confidence, better alignment with target venues, and meaningful engagement with multiple exemplars, while safeguards against plagiarism were valued and effective. The work argues for a community-centered design that emphasizes learning and transparency over automated text generation, offering a practical approach to onboarding new members and preserving scholarly provenance. Overall, CorpusStudio demonstrates how writing-support tools can surface structural and stylistic norms to improve scientific communication and learning within a given community.

Abstract

Many communities, including the scientific community, develop implicit writing norms. Understanding them is crucial for effective communication with that community. Writers gradually develop an implicit understanding of norms by reading papers and receiving feedback on their writing. However, it is difficult to both externalize this knowledge and apply it to one's own writing. We propose two new writing support concepts that reify document and sentence-level patterns in a given text corpus: (1) an ordered distribution over section titles and (2) given the user's draft and cursor location, many retrieved contextually relevant sentences. Recurring words in the latter are algorithmically highlighted to help users see any emergent norms. Study results (N=16) show that participants revised the structure and content using these concepts, gaining confidence in aligning with or breaking norms after reviewing many examples. These results demonstrate the value of reifying distributions over other authors' writing choices during the writing process.

CorpusStudio: Surfacing Emergent Patterns in a Corpus of Prior Work while Writing

TL;DR

CorpusStudio tackles the challenge of externalizing implicit community writing norms by surfacing emergent patterns from a corpus of prior work during writing. It introduces two core concepts: a document-level ordered distribution over section titles and a sentence-level retrieval of analogous examples with highlighting to reveal norms and outliers, all integrated into a single writing interface. In a controlled study with 16 participants, users drafting outlines and manuscript sections reported increased confidence, better alignment with target venues, and meaningful engagement with multiple exemplars, while safeguards against plagiarism were valued and effective. The work argues for a community-centered design that emphasizes learning and transparency over automated text generation, offering a practical approach to onboarding new members and preserving scholarly provenance. Overall, CorpusStudio demonstrates how writing-support tools can surface structural and stylistic norms to improve scientific communication and learning within a given community.

Abstract

Many communities, including the scientific community, develop implicit writing norms. Understanding them is crucial for effective communication with that community. Writers gradually develop an implicit understanding of norms by reading papers and receiving feedback on their writing. However, it is difficult to both externalize this knowledge and apply it to one's own writing. We propose two new writing support concepts that reify document and sentence-level patterns in a given text corpus: (1) an ordered distribution over section titles and (2) given the user's draft and cursor location, many retrieved contextually relevant sentences. Recurring words in the latter are algorithmically highlighted to help users see any emergent norms. Study results (N=16) show that participants revised the structure and content using these concepts, gaining confidence in aligning with or breaking norms after reviewing many examples. These results demonstrate the value of reifying distributions over other authors' writing choices during the writing process.

Paper Structure

This paper contains 99 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Side-by-side comparison of sentence rendering modes to help the writer identify textual commonalities and variations across retrieved sentences: (left) highlight each commonly recurring word with the same distinct color versus (right) greying out repetitions of words in subsequent sentences
  • Figure 2: Illustration of the bookmarking feature showing how writers can save sentences and add notes, with the original bookmarked text preserved for context. In this example, the writer added a reminder to note the gender distribution of participants.
  • Figure 3: Sentence embedding workflow designed to retrieve analogous content across papers by incorporating section context. The process prepends section titles to sentences before vectorization, enabling the system to find similar content from matching sections (e.g., 'Participants' sections) across different papers, as shown in the three retrieved examples on the right.
  • Figure 4: Examples of different users' immediate writing contexts when requesting sentence examples, and the first of CorpusStudio's retrieval results. Note that these are representations of system logs, not user interface screenshots. (Left) P3 queried for inspiration on how to introduce a formative study section, retrieving examples of how other papers structured and presented their study introductions. (Right) P4 searched for ways to acknowledge design and study challenges, finding various examples of how other papers discussed limitations and technical constraints.
  • Figure 5: Usefulness of drafting an outline in the baseline condition compared to in CorpusStudio
  • ...and 6 more figures