Table of Contents
Fetching ...

LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning

You-Le Fang, Dong-Shan Jian, Xiang Li, Ce Meng, Ling-Shi Meng, Chen-Xu Yan, Zhi-Zhang Bian, Yan-Qing Ma

TL;DR

LOCA introduces a principled augment-and-review framework that enforces logical completeness by transforming raw scientific QA answers into structured chains with explicit principles and derivations. Through iterative augmentation and dual-layered review, LOCA effectively filters noisy corpora and produces high-quality, verifiable reasoning, significantly lowering residual error rates on physics benchmarks. The approach yields a scalable path to reliable scientific AI training and evaluation, with demonstrated improvements across multiple physics datasets and clear ablation evidence that both chain augmentation and specialized review are essential. The methodology is broadly applicable to principle-based domains beyond physics and has educational and benchmarking benefits.

Abstract

While Large Language Models (LLMs) excel in general domains, their reliability often falls short in scientific problem-solving. The advancement of scientific AI depends on large-scale, high-quality corpora. However, existing scientific question-answering (QA) datasets suffer from high error rates, frequently resulting from logical leaps and implicit reasoning within the answers. To address this issue, we introduce LOCA (Logical Chain Augmentation), a novel framework for automatically cleaning scientific corpora, implemented through an augment-and-review loop. At its core, LOCA enhances raw answers by completing missing logical steps and explicitly separating the underlying scientific principle from its subsequent derivation. By applying LOCA to challenging scientific corpora, we demonstrate that it can automatically filter noisy datasets, typically reducing the error rate from as high as 20\% to below 2\%. LOCA provides a scalable and effective methodology for creating high-quality scientific corpora, paving the way for more reliable training and evaluation of scientific AI.

LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning

TL;DR

LOCA introduces a principled augment-and-review framework that enforces logical completeness by transforming raw scientific QA answers into structured chains with explicit principles and derivations. Through iterative augmentation and dual-layered review, LOCA effectively filters noisy corpora and produces high-quality, verifiable reasoning, significantly lowering residual error rates on physics benchmarks. The approach yields a scalable path to reliable scientific AI training and evaluation, with demonstrated improvements across multiple physics datasets and clear ablation evidence that both chain augmentation and specialized review are essential. The methodology is broadly applicable to principle-based domains beyond physics and has educational and benchmarking benefits.

Abstract

While Large Language Models (LLMs) excel in general domains, their reliability often falls short in scientific problem-solving. The advancement of scientific AI depends on large-scale, high-quality corpora. However, existing scientific question-answering (QA) datasets suffer from high error rates, frequently resulting from logical leaps and implicit reasoning within the answers. To address this issue, we introduce LOCA (Logical Chain Augmentation), a novel framework for automatically cleaning scientific corpora, implemented through an augment-and-review loop. At its core, LOCA enhances raw answers by completing missing logical steps and explicitly separating the underlying scientific principle from its subsequent derivation. By applying LOCA to challenging scientific corpora, we demonstrate that it can automatically filter noisy datasets, typically reducing the error rate from as high as 20\% to below 2\%. LOCA provides a scalable and effective methodology for creating high-quality scientific corpora, paving the way for more reliable training and evaluation of scientific AI.

Paper Structure

This paper contains 27 sections, 55 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Pipeline of LOCA. LOCA employs an iterative augment-and-review loop. In each iteration, given a raw answer with some reasoning process ($A_{\text{aug}}$), an augmentation agent structures it through chain completion and structured decomposition; this structured output is then assessed by specialized review agents. Based on the feedback, the answer is either refined for the next iteration, accepted after passing multiple checks, or rejected. Accepted answers undergo a final external consistency check against $A_{\text{raw}}$, while rejected ones can be flexibly routed to human experts for review.
  • Figure 2: Impact on evaluating LLMs' performance. We compare model performance on 3 versions of PHYBench: the original Raw set (100 Qs); the high-accuracy Filtered subset accepted by LOCA (59 Qs); and the Corrected set (100 Qs), where flawed QA pairs are manually fixed powered by LOCA’s augmentation.