Table of Contents
Fetching ...

EQUATE: A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference

Abhilasha Ravichander, Aakanksha Naik, Carolyn Rose, Eduard Hovy

TL;DR

EQUATE introduces a comprehensive benchmark for quantitative reasoning in natural language inference, combining five diverse test sets from real-world and synthetic sources. The study benchmarks a wide range of neural NLI models and introduces Q-REAS, a symbolic baseline for numerical reasoning, revealing that neural models largely rely on lexical cues rather than quantitative reasoning. The results show Q-REAS excels at numerical aspects while neural models capture verbal nuances, underscoring the need for hybrid neuro-symbolic approaches. Overall, EQUATE provides a rigorous framework to diagnose and drive progress in quantitative language understanding.

Abstract

Quantitative reasoning is a higher-order reasoning skill that any intelligent natural language understanding system can reasonably be expected to handle. We present EQUATE (Evaluating Quantitative Understanding Aptitude in Textual Entailment), a new framework for quantitative reasoning in textual entailment. We benchmark the performance of 9 published NLI models on EQUATE, and find that on average, state-of-the-art methods do not achieve an absolute improvement over a majority-class baseline, suggesting that they do not implicitly learn to reason with quantities. We establish a new baseline Q-REAS that manipulates quantities symbolically. In comparison to the best performing NLI model, it achieves success on numerical reasoning tests (+24.2%), but has limited verbal reasoning capabilities (-8.1%). We hope our evaluation framework will support the development of models of quantitative reasoning in language understanding.

EQUATE: A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference

TL;DR

EQUATE introduces a comprehensive benchmark for quantitative reasoning in natural language inference, combining five diverse test sets from real-world and synthetic sources. The study benchmarks a wide range of neural NLI models and introduces Q-REAS, a symbolic baseline for numerical reasoning, revealing that neural models largely rely on lexical cues rather than quantitative reasoning. The results show Q-REAS excels at numerical aspects while neural models capture verbal nuances, underscoring the need for hybrid neuro-symbolic approaches. Overall, EQUATE provides a rigorous framework to diagnose and drive progress in quantitative language understanding.

Abstract

Quantitative reasoning is a higher-order reasoning skill that any intelligent natural language understanding system can reasonably be expected to handle. We present EQUATE (Evaluating Quantitative Understanding Aptitude in Textual Entailment), a new framework for quantitative reasoning in textual entailment. We benchmark the performance of 9 published NLI models on EQUATE, and find that on average, state-of-the-art methods do not achieve an absolute improvement over a majority-class baseline, suggesting that they do not implicitly learn to reason with quantities. We establish a new baseline Q-REAS that manipulates quantities symbolically. In comparison to the best performing NLI model, it achieves success on numerical reasoning tests (+24.2%), but has limited verbal reasoning capabilities (-8.1%). We hope our evaluation framework will support the development of models of quantitative reasoning in language understanding.

Paper Structure

This paper contains 23 sections, 1 figure, 8 tables, 1 algorithm.

Figures (1)

  • Figure 1: Overview of Q-REAS baseline.