Evaluating Step-by-Step Reasoning through Symbolic Verification

Yi-Fan Zhang; Hanlin Zhang; Li Erran Li; Eric Xing

Evaluating Step-by-Step Reasoning through Symbolic Verification

Yi-Fan Zhang, Hanlin Zhang, Li Erran Li, Eric Xing

TL;DR

Comprehensive experiments are included to systematically compare LMLP with CoT in deductive reasoning settings, showing that LMLP enjoys more than $25\%$ higher accuracy than CoT on length generalization benchmarks even with smaller model sizes.

Abstract

Pre-trained language models (LMs) have shown remarkable reasoning performance using explanations or chain-of-thoughts (CoT)) for in-context learning. On the other hand, these reasoning tasks are usually presumed to be more approachable for symbolic programming. To understand the mechanism of reasoning of LMs, we curate synthetic datasets containing equivalent (natural, symbolic) data pairs, where symbolic examples contain first-order logic rules and predicates from non-parametric knowledge bases (KBs), supporting automated verification of intermediate reasoning results. Then we revisit neuro-symbolic approaches and propose to learn from demonstrations containing logic rules and corresponding examples to iteratively reason over KBs, recovering Prolog's backward chaining algorithm and supporting automated verification of LMs' outputs. Comprehensive experiments are included to systematically compare LMLP with CoT in deductive reasoning settings, showing that LMLP enjoys more than $25\%$ higher accuracy than CoT on length generalization benchmarks even with smaller model sizes.

Evaluating Step-by-Step Reasoning through Symbolic Verification

TL;DR

Comprehensive experiments are included to systematically compare LMLP with CoT in deductive reasoning settings, showing that LMLP enjoys more than

higher accuracy than CoT on length generalization benchmarks even with smaller model sizes.

Abstract

higher accuracy than CoT on length generalization benchmarks even with smaller model sizes.

Paper Structure (12 sections, 4 figures, 12 tables)

This paper contains 12 sections, 4 figures, 12 tables.

Introduction
Related Works
Methodology Overview
Experiments
Comparisons of LMLP and CoT
Analysis of LMLP
Analysis of Demonstrations of ICL
Concluding Remarks
Extended Related Work
Algorithm Description
Data Generation.
Additional Experimental Setups and Results

Figures (4)

Figure 1: Deductive reasoning performance (human evaluation accuracy) comparisons on the CLUTRR-LP given training data with story length 2, 3, 4.
Figure 2: Illustration of a deductive reasoning example and iterative prompting of LMLP. LMLP retrieves a first-order logic rule and an associated grounded example to answer the question. It stops when predefined maximum iterations or the target entity of interest is reached. The reasoning path explains the sister concept.
Figure 3: Schematic overview of (a) LMLP and (b) CoT.
Figure 4: (a) Effect of the number of templates for LMLP on CLUTRR-LP. (b) The effects of noisy facts for LMLP on CLUTRR-LP. Ablation on the scaling of (c) Planning LMs.

Evaluating Step-by-Step Reasoning through Symbolic Verification

TL;DR

Abstract

Evaluating Step-by-Step Reasoning through Symbolic Verification

Authors

TL;DR

Abstract

Table of Contents

Figures (4)