Table of Contents
Fetching ...

Scientific Statement Classification over arXiv.org

Deyan Ginev, Bruce R. Miller

TL;DR

This work defines a novel paragraph-level scientific statement classification task over arXiv, built from the machine-readable arXMLiv corpus to yield a large-scale labeled dataset. It introduces a 50-label scheme and a reduced 13-nest taxonomy, along with math lexeme serialization to jointly model text and formulas. Baseline analyses show strong performance, with a BiLSTM encoder-decoder achieving up to 0.91 F1 on the 13-class task, and reveal meaningful class separability and linguistic nests. The authors release open preprocessing, models, and the dataset to enable reproducible research, and discuss limitations and directions toward richer document-level discourse modeling and future extensions.

Abstract

We introduce a new classification task for scientific statements and release a large-scale dataset for supervised learning. Our resource is derived from a machine-readable representation of the arXiv.org collection of preprint articles. We explore fifty author-annotated categories and empirically motivate a task design of grouping 10.5 million annotated paragraphs into thirteen classes. We demonstrate that the task setup aligns with known success rates from the state of the art, peaking at a 0.91 F1-score via a BiLSTM encoder-decoder model. Additionally, we introduce a lexeme serialization for mathematical formulas, and observe that context-aware models could improve when also trained on the symbolic modality. Finally, we discuss the limitations of both data and task design, and outline potential directions towards increasingly complex models of scientific discourse, beyond isolated statements.

Scientific Statement Classification over arXiv.org

TL;DR

This work defines a novel paragraph-level scientific statement classification task over arXiv, built from the machine-readable arXMLiv corpus to yield a large-scale labeled dataset. It introduces a 50-label scheme and a reduced 13-nest taxonomy, along with math lexeme serialization to jointly model text and formulas. Baseline analyses show strong performance, with a BiLSTM encoder-decoder achieving up to 0.91 F1 on the 13-class task, and reveal meaningful class separability and linguistic nests. The authors release open preprocessing, models, and the dataset to enable reproducible research, and discuss limitations and directions toward richer document-level discourse modeling and future extensions.

Abstract

We introduce a new classification task for scientific statements and release a large-scale dataset for supervised learning. Our resource is derived from a machine-readable representation of the arXiv.org collection of preprint articles. We explore fifty author-annotated categories and empirically motivate a task design of grouping 10.5 million annotated paragraphs into thirteen classes. We demonstrate that the task setup aligns with known success rates from the state of the art, peaking at a 0.91 F1-score via a BiLSTM encoder-decoder model. Additionally, we introduce a lexeme serialization for mathematical formulas, and observe that context-aware models could improve when also trained on the symbolic modality. Finally, we discuss the limitations of both data and task design, and outline potential directions towards increasingly complex models of scientific discourse, beyond isolated statements.

Paper Structure

This paper contains 12 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Plain-text equivalent with sub-formula lexemes, for a LaTeX-authored remark
  • Figure 2: Normalized confusion matrix of a 50-class BiLSTM encoder-decoder
  • Figure 3: Normalized confusion matrix of a 13-class BiLSTM encoder-decoder