Table of Contents
Fetching ...

Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models

Tong Zeng, Daniel E. Acuna

TL;DR

The study tackles the problem of identifying sentences that require citations (citation worthiness) by proposing an attention-based BiLSTM architecture that leverages contextual sentence information. It introduces a large PMOA-CITE dataset from PubMed Central, enabling robust deep learning training and cross-domain evaluation with ACL-ARC, and demonstrates state-of-the-art performance on ACL-ARC ($F_{1}=0.507$) and strong results on PMOA-CITE ($F_{1}=0.856$) with context. Complementary interpretable models (Elastic-net Logistic Regression and Random Forest) illuminate the linguistic signals and topics that drive citation decisions, highlighting the pivotal roles of target sentences, surrounding sentences, and section types. The work also reveals practical implications for publishing workflows, including data-quality issues and potential pre-submission checks, and provides open-source code, datasets, and a web tool to the research community.

Abstract

Scientist learn early on how to cite scientific sources to support their claims. Sometimes, however, scientists have challenges determining where a citation should be situated -- or, even worse, fail to cite a source altogether. Automatically detecting sentences that need a citation (i.e., citation worthiness) could solve both of these issues, leading to more robust and well-constructed scientific arguments. Previous researchers have applied machine learning to this task but have used small datasets and models that do not take advantage of recent algorithmic developments such as attention mechanisms in deep learning. We hypothesize that we can develop significantly accurate deep learning architectures that learn from large supervised datasets constructed from open access publications. In this work, we propose a Bidirectional Long Short-Term Memory (BiLSTM) network with attention mechanism and contextual information to detect sentences that need citations. We also produce a new, large dataset (PMOA-CITE) based on PubMed Open Access Subset, which is orders of magnitude larger than previous datasets. Our experiments show that our architecture achieves state of the art performance on the standard ACL-ARC dataset ($F_{1}=0.507$) and exhibits high performance ($F_{1}=0.856$) on the new PMOA-CITE. Moreover, we show that it can transfer learning across these datasets. We further use interpretable models to illuminate how specific language is used to promote and inhibit citations. We discover that sections and surrounding sentences are crucial for our improved predictions. We further examined purported mispredictions of the model, and uncovered systematic human mistakes in citation behavior and source data. This opens the door for our model to check documents during pre-submission and pre-archival procedures. We make this new dataset, the code, and a web-based tool available to the community.

Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models

TL;DR

The study tackles the problem of identifying sentences that require citations (citation worthiness) by proposing an attention-based BiLSTM architecture that leverages contextual sentence information. It introduces a large PMOA-CITE dataset from PubMed Central, enabling robust deep learning training and cross-domain evaluation with ACL-ARC, and demonstrates state-of-the-art performance on ACL-ARC () and strong results on PMOA-CITE () with context. Complementary interpretable models (Elastic-net Logistic Regression and Random Forest) illuminate the linguistic signals and topics that drive citation decisions, highlighting the pivotal roles of target sentences, surrounding sentences, and section types. The work also reveals practical implications for publishing workflows, including data-quality issues and potential pre-submission checks, and provides open-source code, datasets, and a web tool to the research community.

Abstract

Scientist learn early on how to cite scientific sources to support their claims. Sometimes, however, scientists have challenges determining where a citation should be situated -- or, even worse, fail to cite a source altogether. Automatically detecting sentences that need a citation (i.e., citation worthiness) could solve both of these issues, leading to more robust and well-constructed scientific arguments. Previous researchers have applied machine learning to this task but have used small datasets and models that do not take advantage of recent algorithmic developments such as attention mechanisms in deep learning. We hypothesize that we can develop significantly accurate deep learning architectures that learn from large supervised datasets constructed from open access publications. In this work, we propose a Bidirectional Long Short-Term Memory (BiLSTM) network with attention mechanism and contextual information to detect sentences that need citations. We also produce a new, large dataset (PMOA-CITE) based on PubMed Open Access Subset, which is orders of magnitude larger than previous datasets. Our experiments show that our architecture achieves state of the art performance on the standard ACL-ARC dataset () and exhibits high performance () on the new PMOA-CITE. Moreover, we show that it can transfer learning across these datasets. We further use interpretable models to illuminate how specific language is used to promote and inhibit citations. We discover that sections and surrounding sentences are crucial for our improved predictions. We further examined purported mispredictions of the model, and uncovered systematic human mistakes in citation behavior and source data. This opens the door for our model to check documents during pre-submission and pre-archival procedures. We make this new dataset, the code, and a web-based tool available to the community.
Paper Structure (32 sections, 23 equations, 8 figures, 13 tables)

This paper contains 32 sections, 23 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Citation worthiness prediction problem. For a given sentence ($S_{n}$), the goal of the task is to predict whether it needs a citation. The prediction task may use the section, the previous and next sentences (i.e., $S_{n-1}$ and $S_{n+1}$) for such prediction.
  • Figure 2: A sample of a PMC Open Access Subset (PMOAS) XML. The structure is defined by a standard Document Type Definition (DTD) which makes all articles consistent. In particular, the tag and attributes of a citation are well known.
  • Figure 3: The distribution of sentence length
  • Figure 4: The architecture of the proposed attention-based BiLSTM neural network.
  • Figure 5: The train and validation $F_{1}$ performance for Att-BiLSTM$_{cos}$ using ACL-ARC dataset: x-axis shows the number of epoch, the y-axis is $F_{1}$ score.
  • ...and 3 more figures