Does It Make Sense? And Why? A Pilot Study for Sense Making and Explanation
Cunxiang Wang, Shuailong Liang, Yue Zhang, Xiaonan Li, Tian Gao
TL;DR
This work introduces a direct sense-making benchmark for natural language understanding, separating sensible from nonsensical statements and requiring a justification for why a statement fails. The testset comprises two subtasks, Sen-Making and Explanation, evaluated on language models (ELMo, BERT) and humans. Results show that while models outperform random baselines, they lag far behind humans, especially in inference-heavy Explanation tasks, highlighting gaps in multi-step reasoning. The authors also provide corpus analysis and a case study demonstrating the limits of current LM-based sense-making and emphasize the benchmark's potential for improving interpretability and guiding future research.
Abstract
Introducing common sense to natural language understanding systems has received increasing research attention. It remains a fundamental question on how to evaluate whether a system has a sense making capability. Existing benchmarks measures commonsense knowledge indirectly and without explanation. In this paper, we release a benchmark to directly test whether a system can differentiate natural language statements that make sense from those that do not make sense. In addition, a system is asked to identify the most crucial reason why a statement does not make sense. We evaluate models trained over large-scale language modeling tasks as well as human performance, showing that there are different challenges for system sense making.
