Table of Contents
Fetching ...

Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification

Soumya Sanyal, Tianyi Xiao, Jiacheng Liu, Wenya Wang, Xiang Ren

TL;DR

This work investigates whether machines can outperform humans in complex reasoning by focusing on entailment verification (EV) with multi-sentence premises. It builds a diverse EV benchmark spanning natural language inference, contextual QA, and rationale datasets to study human versus LLM performance, finding that LLMs excel at multi-hop reasoning while humans excel at simple deductive tasks. The authors finetune Flan-T5-xxl with both classification and ranking objectives to produce a strong open-source EV model that rivals GPT-4 and beats GPT-3.5, and demonstrate its utility in filtering inconsistent chain-of-thought rationales during self-consistency decoding, yielding about a 6% average accuracy gain across MCQ tasks. The results highlight practical implications for improving long-context reasoning and rationale fidelity in NLP systems, while also outlining limitations related to data bias, generalization, and potential risk factors.

Abstract

Making inferences in text comprehension to understand the meaning is essential in language processing. This work studies the entailment verification (EV) problem of multi-sentence premises that requires a system to make multiple inferences implicitly. Studying EV for such complex premises is important because modern NLP problems, such as detecting inconsistent model-generated rationales, require complex multi-hop reasoning. However, current textual inference datasets mostly contain short premises that only partially focus on these challenges. To address this, we compile an EV benchmark that includes datasets from three NLP domains (NLI, contextual QA, and rationales) containing multi-sentence premises. On benchmarking humans and LLMs, we find that LLMs are better than humans in multi-hop reasoning across extended contexts, while humans perform better in simple deductive reasoning tasks. We also finetune a Flan-T5 model for EV using two training objectives to obtain a strong open-source model that outperforms GPT-3.5 and rivals GPT-4. Finally, we use this model to filter out inconsistent model-generated rationales in self-consistency decoding, resulting in a 6% accuracy improvement on average across three MCQ datasets.

Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification

TL;DR

This work investigates whether machines can outperform humans in complex reasoning by focusing on entailment verification (EV) with multi-sentence premises. It builds a diverse EV benchmark spanning natural language inference, contextual QA, and rationale datasets to study human versus LLM performance, finding that LLMs excel at multi-hop reasoning while humans excel at simple deductive tasks. The authors finetune Flan-T5-xxl with both classification and ranking objectives to produce a strong open-source EV model that rivals GPT-4 and beats GPT-3.5, and demonstrate its utility in filtering inconsistent chain-of-thought rationales during self-consistency decoding, yielding about a 6% average accuracy gain across MCQ tasks. The results highlight practical implications for improving long-context reasoning and rationale fidelity in NLP systems, while also outlining limitations related to data bias, generalization, and potential risk factors.

Abstract

Making inferences in text comprehension to understand the meaning is essential in language processing. This work studies the entailment verification (EV) problem of multi-sentence premises that requires a system to make multiple inferences implicitly. Studying EV for such complex premises is important because modern NLP problems, such as detecting inconsistent model-generated rationales, require complex multi-hop reasoning. However, current textual inference datasets mostly contain short premises that only partially focus on these challenges. To address this, we compile an EV benchmark that includes datasets from three NLP domains (NLI, contextual QA, and rationales) containing multi-sentence premises. On benchmarking humans and LLMs, we find that LLMs are better than humans in multi-hop reasoning across extended contexts, while humans perform better in simple deductive reasoning tasks. We also finetune a Flan-T5 model for EV using two training objectives to obtain a strong open-source model that outperforms GPT-3.5 and rivals GPT-4. Finally, we use this model to filter out inconsistent model-generated rationales in self-consistency decoding, resulting in a 6% accuracy improvement on average across three MCQ datasets.
Paper Structure (58 sections, 2 equations)