Table of Contents
Fetching ...

Context Selection for Hypothesis and Statistical Evidence Extraction from Full-Text Scientific Articles

Sai Koneru, Jian Wu, Sarah Rajtmajer

Abstract

Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is central to the synthesis of empirical findings, but remains difficult due to document length and the distribution of scientific arguments across sections of the paper. The work studies a sequential full-text extraction setting, where the statement of a primary finding in an article's abstract is linked to (i) a corresponding hypothesis statement in the paper body and (ii) the statistical evidence that supports or refutes that hypothesis. This formulation induces a challenging within-document retrieval setting in which many candidate paragraphs are topically related to the finding but differ in rhetorical role, creating hard negatives for retrieval and extraction. Using a two-stage retrieve-and-extract framework, we conduct a controlled study of retrieval design choices, varying context quantity, context quality (standard Retrieval Augmented Generation, reranking, and a fine-tuned retriever paired with reranking), as well as an oracle paragraph setting to separate retrieval failures from extraction limits across four Large Language Model extractors. We find that targeted context selection consistently improves hypothesis extraction relative to full-text prompting, with gains concentrated in configurations that optimize retrieval quality and context cleanliness. In contrast, statistical evidence extraction remains substantially harder. Even with oracle paragraphs, performance remains moderate, indicating persistent extractor limitations in handling hybrid numeric-textual statements rather than retrieval failures alone.

Context Selection for Hypothesis and Statistical Evidence Extraction from Full-Text Scientific Articles

Abstract

Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is central to the synthesis of empirical findings, but remains difficult due to document length and the distribution of scientific arguments across sections of the paper. The work studies a sequential full-text extraction setting, where the statement of a primary finding in an article's abstract is linked to (i) a corresponding hypothesis statement in the paper body and (ii) the statistical evidence that supports or refutes that hypothesis. This formulation induces a challenging within-document retrieval setting in which many candidate paragraphs are topically related to the finding but differ in rhetorical role, creating hard negatives for retrieval and extraction. Using a two-stage retrieve-and-extract framework, we conduct a controlled study of retrieval design choices, varying context quantity, context quality (standard Retrieval Augmented Generation, reranking, and a fine-tuned retriever paired with reranking), as well as an oracle paragraph setting to separate retrieval failures from extraction limits across four Large Language Model extractors. We find that targeted context selection consistently improves hypothesis extraction relative to full-text prompting, with gains concentrated in configurations that optimize retrieval quality and context cleanliness. In contrast, statistical evidence extraction remains substantially harder. Even with oracle paragraphs, performance remains moderate, indicating persistent extractor limitations in handling hybrid numeric-textual statements rather than retrieval failures alone.
Paper Structure (20 sections, 8 figures, 2 tables)

This paper contains 20 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: An example claim trace spanning the Abstract, Introduction, and Results sections of a paper. Our two-stage pipeline uses an abstract-level finding to extract a hypothesis statement (from the Introduction, in this case) and then locate statistical evidence (Results).
  • Figure 2: Schematic of the sequential retrieve and extract pipeline. Stage 1 (top) uses the abstract finding to extract the Hypothesis, which then serves as part of a composite query for extracting the Statistical Evidence in Stage 2 (bottom).
  • Figure 3: Illustration of component-level evaluation of evidence extraction. The metric assigns partial credit when the extracted span captures the relationship but corresponds to a different reported effect with different numeric details.
  • Figure 4: Hypothesis extraction F1 versus relevant-sentence proportion (RSP), binned. Points show mean F1 within each bin and point size is proportional to the number of papers in the bin.
  • Figure 5: Calibration of the cosine similarity threshold using a precision recall curve. The plot shows the values at different threshold values for the Gemini embedding model. The selected threshold of 0.89 corresponds to the point of maximum F1 score (0.72) on manually labeled set.
  • ...and 3 more figures