Table of Contents
Fetching ...

ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction

Pengze Li, Jiaqi Liu, Junchi Yu, Lihao Liu, Mingyu Ding, Wanli Ouyang, Shixiang Tang, Xi Chen

TL;DR

ARCHE tackles the challenge of evaluating whether LLMs can extract latent, paradigm-grounded reasoning from scientific text. It defines Latent Reasoning Chain Extraction (ARCHE) and the structured Reasoning-Logic Tree (RLT) to represent deduction, induction, and abduction steps, supported by a two-stage generation and validation pipeline. ARCHE Bench, created from 70 Nature Communications articles, provides EC and REA metrics to measure content coverage and step-wise logical validity, respectively. Zero-shot assessments across 10 models reveal a persistent gap: models achieve moderate entity coverage but fall short on coherent, correctly-typed reasoning chains, underscoring the need for paradigm-guided supervision to improve scientific trustworthiness and reproducibility.

Abstract

Large language models (LLMs) are increasingly used in scientific domains. While they can produce reasoning-like content via methods such as chain-of-thought prompting, these outputs are typically unstructured and informal, obscuring whether models truly understand the fundamental reasoning paradigms that underpin scientific inference. To address this, we introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT). In RLT, all reasoning steps are explicitly categorized as one of three variants of Peirce's fundamental inference modes: deduction, induction, or abduction. To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints. We propose two logic-aware evaluation metrics: Entity Coverage (EC) for content completeness and Reasoning Edge Accuracy (REA) for step-by-step logical validity. Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain. These findings highlight a substantial gap between the abilities of current reasoning models and the rigor required for scientific argumentation.

ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction

TL;DR

ARCHE tackles the challenge of evaluating whether LLMs can extract latent, paradigm-grounded reasoning from scientific text. It defines Latent Reasoning Chain Extraction (ARCHE) and the structured Reasoning-Logic Tree (RLT) to represent deduction, induction, and abduction steps, supported by a two-stage generation and validation pipeline. ARCHE Bench, created from 70 Nature Communications articles, provides EC and REA metrics to measure content coverage and step-wise logical validity, respectively. Zero-shot assessments across 10 models reveal a persistent gap: models achieve moderate entity coverage but fall short on coherent, correctly-typed reasoning chains, underscoring the need for paradigm-guided supervision to improve scientific trustworthiness and reproducibility.

Abstract

Large language models (LLMs) are increasingly used in scientific domains. While they can produce reasoning-like content via methods such as chain-of-thought prompting, these outputs are typically unstructured and informal, obscuring whether models truly understand the fundamental reasoning paradigms that underpin scientific inference. To address this, we introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT). In RLT, all reasoning steps are explicitly categorized as one of three variants of Peirce's fundamental inference modes: deduction, induction, or abduction. To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and 38,000 viewpoints. We propose two logic-aware evaluation metrics: Entity Coverage (EC) for content completeness and Reasoning Edge Accuracy (REA) for step-by-step logical validity. Evaluations on 10 leading LLMs on ARCHE Bench reveal that models exhibit a trade-off between REA and EC, and none are yet able to extract a complete and standard reasoning chain. These findings highlight a substantial gap between the abilities of current reasoning models and the rigor required for scientific argumentation.

Paper Structure

This paper contains 43 sections, 2 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of ARCHE Bench.The pipeline includes three stages: Data Processing, where scientific articles are preprocessed to extract introductions, cited abstracts, and viewpoints; RLT Generation, which constructs and repairs reasoning logic trees (RLTs) through automatic validation and self-correction; and Evaluation, where models are assessed using metrics like EC and REA for performance analysis.
  • Figure 2: The Construction of Reasoning-Logic Tree. Viewpoints are first extracted from scientific text, then organized into a hierarchical reasoning structure. Each reasoning edge is annotated with an inference type: deduction, induction, or abduction, based on its logical pattern.
  • Figure 3: Comparative performance of LLM models in terms of EC and REA across three domains: (Left) Physical Sciences, (Middle) Biological Sciences, and (Right) Overall. Each point represents a model’s performance. The green region (BR) indicates a preferable area with both higher coverage and accuracy. The red dashed curve denotes the trade-off frontier.
  • Figure 4: LLM Accuracy Calibration.
  • Figure 5: Prompt for ARCHE (Part 1)
  • ...and 5 more figures