ORACLE: Optimizing Reasoning Abilities of Large Language Models via Constraint-Led Synthetic Data Elicitation

Zhuojie Yang; Wentao Wan; Keze Wang

ORACLE: Optimizing Reasoning Abilities of Large Language Models via Constraint-Led Synthetic Data Elicitation

Zhuojie Yang, Wentao Wan, Keze Wang

Abstract

Training large language models (LLMs) with synthetic reasoning data has become a popular approach to enhancing their reasoning capabilities, while a key factor influencing the effectiveness of this paradigm is the quality of the generated multi-step reasoning data. To generate high-quality reasoning data, many recent methods generate synthetic reasoning paths and filter them based on final answer correctness, often overlooking flaws in intermediate reasoning steps. To enhance the verification of intermediate reasoning steps, prior work primarily resorts to code execution or symbolic reasoning engines. However, code-based validation is restricted to code or mathematical tasks, and reasoning engines require a well-structured and complete context. As a result, existing methods fail to function effectively in natural language reasoning tasks that involve ambiguous or incomplete contexts. In these tasks, synthetic data still lack reliable checks for verifying each reasoning step. To address this challenge, we introduce ORACLE, a structured data generation framework inspired by syllogistic reasoning. ORACLE integrates the generative strengths of LLMs with symbolic supervision: the LLM produces step-wise reasoning contexts, while a symbolic reasoning engine verifies the validity of each intermediate step. By employing a unified prompting template to elicit modular reasoning chains, ORACLE enables fine-grained, step-level validation, facilitating the construction of high-quality multi-step reasoning data. Across six logical, factual, and commonsense reasoning benchmarks, our ORACLE consistently outperforms strong baselines on multiple models.

ORACLE: Optimizing Reasoning Abilities of Large Language Models via Constraint-Led Synthetic Data Elicitation

Abstract

Paper Structure (24 sections, 3 figures, 4 tables)

This paper contains 24 sections, 3 figures, 4 tables.

Introduction
Related Work
Synthetic Reasoning Data Generation.
Verification and Programmatic Supervision.
LLM Supervision and Preference Training.
Methodology
Template Design
Model Training
Beam Search with LLM Evaluation and Reasoning Engine Validation
Experiments
Experiment Setup
Datasets.
Baselines.
Implementation details.
Experimental Results and Analyses
...and 9 more sections

Figures (3)

Figure 1: An overview of two-stage training pipeline of our ORACLE. ORACLE adopts a two-stage training pipeline: Stage 1 employs few-shot prompting and template-based reasoning generation, followed by answer-based and format-based filtering; Stage 2 integrates symbolic reasoning via beam search to produce high-quality reasoning data for supervised fine-tuning(SFT) and Direct Preference Optimization(DPO).
Figure 2: An overview of our structured reasoning template used during data generation of our ORACLE. Each reasoning step consists of modular fields: <QUERY>, <FACTS>, <RULE>, <REVISION>, <REVISION_RESULT>, and <REASONING_RESULT>. This design promotes interpretable reasoning, enables symbolic verification, and facilitates automatic extraction via pattern matching.
Figure 3: An overview of the beam search integrating a symbolic reasoning engine and LLM-based evaluation of our ORACLE. NL means natural language and SL means symbolic language. For each step, the LLM generates a candidate reasoning step in natural language, which is then translated into symbolic form and passed to the reasoning engine for execution. Each candidate is scored based on execution success ($W_1$), LLM's precision assessment ($W_2$), and feasibility estimation ($W_3$). Candidates that are successfully executed receive a final score of $W_1 + W_3$, while others are scored as $W_2 + W_3$. The top-$K$ candidates based on these scores are selected and expanded in the next layer. Complete reasoning paths that produce correct final answers are collected as supervised fine-tuning(SFT) data. Additionally, preference pairs are constructed by comparing symbolically validated nodes with their invalid siblings for Direct Preference Optimization(DPO).

ORACLE: Optimizing Reasoning Abilities of Large Language Models via Constraint-Led Synthetic Data Elicitation

Abstract

ORACLE: Optimizing Reasoning Abilities of Large Language Models via Constraint-Led Synthetic Data Elicitation

Authors

Abstract

Table of Contents

Figures (3)