Table of Contents
Fetching ...

Does It Make Sense? And Why? A Pilot Study for Sense Making and Explanation

Cunxiang Wang, Shuailong Liang, Yue Zhang, Xiaonan Li, Tian Gao

TL;DR

This work introduces a direct sense-making benchmark for natural language understanding, separating sensible from nonsensical statements and requiring a justification for why a statement fails. The testset comprises two subtasks, Sen-Making and Explanation, evaluated on language models (ELMo, BERT) and humans. Results show that while models outperform random baselines, they lag far behind humans, especially in inference-heavy Explanation tasks, highlighting gaps in multi-step reasoning. The authors also provide corpus analysis and a case study demonstrating the limits of current LM-based sense-making and emphasize the benchmark's potential for improving interpretability and guiding future research.

Abstract

Introducing common sense to natural language understanding systems has received increasing research attention. It remains a fundamental question on how to evaluate whether a system has a sense making capability. Existing benchmarks measures commonsense knowledge indirectly and without explanation. In this paper, we release a benchmark to directly test whether a system can differentiate natural language statements that make sense from those that do not make sense. In addition, a system is asked to identify the most crucial reason why a statement does not make sense. We evaluate models trained over large-scale language modeling tasks as well as human performance, showing that there are different challenges for system sense making.

Does It Make Sense? And Why? A Pilot Study for Sense Making and Explanation

TL;DR

This work introduces a direct sense-making benchmark for natural language understanding, separating sensible from nonsensical statements and requiring a justification for why a statement fails. The testset comprises two subtasks, Sen-Making and Explanation, evaluated on language models (ELMo, BERT) and humans. Results show that while models outperform random baselines, they lag far behind humans, especially in inference-heavy Explanation tasks, highlighting gaps in multi-step reasoning. The authors also provide corpus analysis and a case study demonstrating the limits of current LM-based sense-making and emphasize the benchmark's potential for improving interpretability and guiding future research.

Abstract

Introducing common sense to natural language understanding systems has received increasing research attention. It remains a fundamental question on how to evaluate whether a system has a sense making capability. Existing benchmarks measures commonsense knowledge indirectly and without explanation. In this paper, we release a benchmark to directly test whether a system can differentiate natural language statements that make sense from those that do not make sense. In addition, a system is asked to identify the most crucial reason why a statement does not make sense. We evaluate models trained over large-scale language modeling tasks as well as human performance, showing that there are different challenges for system sense making.

Paper Structure

This paper contains 9 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: Samples of our dataset
  • Figure 2: Number of 'Different Words'