Table of Contents
Fetching ...

Simplifying Scholarly Abstracts for Accessible Digital Libraries

Haining Wang, Jason Clark

TL;DR

The paper tackles the problem of making scholarly abstracts accessible to non-experts by creating the SASS corpus of 3,430 abstract–significance statement pairs and fine-tuning compact, in-house language-processing pipelines to rewrite abstracts into plain language. It demonstrates that OLMo-1B, Gemma-2B, and Gemma-7B can improve readability by about 3 ARI points while preserving substantive semantics, with Phi-2 showing more variability. Compared to zero-shot GPT-3.5/4o baselines, the fine-tuned approaches offer strong semantic retention and privacy advantages, enabling deployment within digital libraries. The work points to practical benefits for inclusivity and suggests future directions including RL-based decoding to enhance word-level accessibility and integration of rewritten abstracts into search indices for improved information retrieval.

Abstract

Standing at the forefront of knowledge dissemination, digital libraries curate vast collections of scientific literature. However, these scholarly writings are often laden with jargon and tailored for domain experts rather than the general public. As librarians, we strive to offer services to a diverse audience, including those with lower reading levels. To extend our services beyond mere access, we propose fine-tuning a language model to rewrite scholarly abstracts into more comprehensible versions, thereby making scholarly literature more accessible when requested. We began by introducing a corpus specifically designed for training models to simplify scholarly abstracts. This corpus consists of over three thousand pairs of abstracts and significance statements from diverse disciplines. We then fine-tuned four language models using this corpus. The outputs from the models were subsequently examined both quantitatively for accessibility and semantic coherence, and qualitatively for language quality, faithfulness, and completeness. Our findings show that the resulting models can improve readability by over three grade levels, while maintaining fidelity to the original content. Although commercial state-of-the-art models still hold an edge, our models are much more compact, can be deployed locally in an affordable manner, and alleviate the privacy concerns associated with using commercial models. We envision this work as a step toward more inclusive and accessible libraries, improving our services for young readers and those without a college degree.

Simplifying Scholarly Abstracts for Accessible Digital Libraries

TL;DR

The paper tackles the problem of making scholarly abstracts accessible to non-experts by creating the SASS corpus of 3,430 abstract–significance statement pairs and fine-tuning compact, in-house language-processing pipelines to rewrite abstracts into plain language. It demonstrates that OLMo-1B, Gemma-2B, and Gemma-7B can improve readability by about 3 ARI points while preserving substantive semantics, with Phi-2 showing more variability. Compared to zero-shot GPT-3.5/4o baselines, the fine-tuned approaches offer strong semantic retention and privacy advantages, enabling deployment within digital libraries. The work points to practical benefits for inclusivity and suggests future directions including RL-based decoding to enhance word-level accessibility and integration of rewritten abstracts into search indices for improved information retrieval.

Abstract

Standing at the forefront of knowledge dissemination, digital libraries curate vast collections of scientific literature. However, these scholarly writings are often laden with jargon and tailored for domain experts rather than the general public. As librarians, we strive to offer services to a diverse audience, including those with lower reading levels. To extend our services beyond mere access, we propose fine-tuning a language model to rewrite scholarly abstracts into more comprehensible versions, thereby making scholarly literature more accessible when requested. We began by introducing a corpus specifically designed for training models to simplify scholarly abstracts. This corpus consists of over three thousand pairs of abstracts and significance statements from diverse disciplines. We then fine-tuned four language models using this corpus. The outputs from the models were subsequently examined both quantitatively for accessibility and semantic coherence, and qualitatively for language quality, faithfulness, and completeness. Our findings show that the resulting models can improve readability by over three grade levels, while maintaining fidelity to the original content. Although commercial state-of-the-art models still hold an edge, our models are much more compact, can be deployed locally in an affordable manner, and alleviate the privacy concerns associated with using commercial models. We envision this work as a step toward more inclusive and accessible libraries, improving our services for young readers and those without a college degree.
Paper Structure (16 sections, 1 equation, 2 figures, 2 tables)

This paper contains 16 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Discipline and readability distributions of abstracts and significance statements found in the training set of the Scientific Abstract-Significance Statement Corpus. The count of paired samples in different disciplines is shown in blue bars on a log10 scale (disciplines with fewer than three samples are not shown). Readability is measured using the Automated Readability Index (ARI), which estimates the number of years of schooling required to understand a text. On average, abstracts have a readability slightly below 20 ARI, indicating a post-graduate level. Significance statements are generally more readable than their corresponding abstracts.
  • Figure 2: We annotated 5% of generated outputs from OLMo-1B, Gemma-2B/-7B, and Phi-2 with respect to language quality, faithfulness, and completeness. The fine-tuned Gemma-7B performed the best on balance, followed by Gemma-2B and OLMo-1B.