Table of Contents
Fetching ...

SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions

Huitong Pan, Qi Zhang, Cornelia Caragea, Eduard Dragut, Longin Jan Latecki

TL;DR

SciDMT introduces the largest SEMD corpus to date, pairing a 48,049-document main collection with over 1.8 million weak DMT mentions (datasets, methods, tasks) and a 100-paper human-annotated evaluation set. It combines distant supervision from Papers with Code and S2ORC with ontology linking through ITO, plus exhaustive regex preprocessing to maximize coverage, and provides rich document-level context for training SEMD models. The paper evaluates diverse baselines, including SciBERT and GPT-3.5, and demonstrates that large-scale weak labels can be effectively complemented by human annotations to achieve strong performance, with analyses of training scale and error patterns. This resource promises substantial benefits for scientific information extraction, indexing, and retrieval, while also outlining limitations and concrete avenues for future improvements such as handling unseen/ambiguous mentions and refining label quality.

Abstract

We present SciDMT, an enhanced and expanded corpus for scientific mention detection, offering a significant advancement over existing related resources. SciDMT contains annotated scientific documents for datasets (D), methods (M), and tasks (T). The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes. To the best of our knowledge, SciDMT is the largest corpus for scientific entity mention detection. The corpus's scale and diversity are instrumental in developing and refining models for tasks such as indexing scientific papers, enhancing information retrieval, and improving the accessibility of scientific knowledge. We demonstrate the corpus's utility through experiments with advanced deep learning architectures like SciBERT and GPT-3.5. Our findings establish performance baselines and highlight unresolved challenges in scientific mention detection. SciDMT serves as a robust benchmark for the research community, encouraging the development of innovative models to further the field of scientific information extraction.

SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions

TL;DR

SciDMT introduces the largest SEMD corpus to date, pairing a 48,049-document main collection with over 1.8 million weak DMT mentions (datasets, methods, tasks) and a 100-paper human-annotated evaluation set. It combines distant supervision from Papers with Code and S2ORC with ontology linking through ITO, plus exhaustive regex preprocessing to maximize coverage, and provides rich document-level context for training SEMD models. The paper evaluates diverse baselines, including SciBERT and GPT-3.5, and demonstrates that large-scale weak labels can be effectively complemented by human annotations to achieve strong performance, with analyses of training scale and error patterns. This resource promises substantial benefits for scientific information extraction, indexing, and retrieval, while also outlining limitations and concrete avenues for future improvements such as handling unseen/ambiguous mentions and refining label quality.

Abstract

We present SciDMT, an enhanced and expanded corpus for scientific mention detection, offering a significant advancement over existing related resources. SciDMT contains annotated scientific documents for datasets (D), methods (M), and tasks (T). The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes. To the best of our knowledge, SciDMT is the largest corpus for scientific entity mention detection. The corpus's scale and diversity are instrumental in developing and refining models for tasks such as indexing scientific papers, enhancing information retrieval, and improving the accessibility of scientific knowledge. We demonstrate the corpus's utility through experiments with advanced deep learning architectures like SciBERT and GPT-3.5. Our findings establish performance baselines and highlight unresolved challenges in scientific mention detection. SciDMT serves as a robust benchmark for the research community, encouraging the development of innovative models to further the field of scientific information extraction.
Paper Structure (25 sections, 3 figures, 4 tables)

This paper contains 25 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Example document-level annotation (top-left) and dictionary entries in SciDMT. We mark each occurrence of dataset (D), method (M) and task (T) in papers and give the in-text spans, entity indexes and the BIO tags. For example, the method mention 'EfficientNet' spans from 2469 to 2481 and has a BIO tag as 'B-M'.
  • Figure 2: Trend of F1 when varying the number (N) of human-annotated samples used for fine-tuning. Each line in the graph, represented in the legend, corresponds to a model being trained with a distinct dataset.
  • Figure 3: Validation performance of SciBERT when training on SciDMT as the train size increases.