Table of Contents
Fetching ...

Filling the Gap: Is Commonsense Knowledge Generation useful for Natural Language Inference?

Chathuri Jayaweera, Brianna Yanqui, Bonnie Dorr

TL;DR

This work investigates whether Large Language Models can generate commonsense axioms to aid Natural Language Inference (NLI). It introduces a pipeline that generates axioms (P1), injects them before inference (P2), and compares against a direct inference baseline (P3), plus a hybrid selective-access strategy guided by a helpfulness rating. Through experiments on SNLI and ANLI using Llama-3.1-70B-Instruct and gpt-oss-120b, the hybrid approach achieves consistent improvements by effectively combining pre-prediction axiom injection with post-prediction reasoning. The findings highlight the value of targeted external commonsense knowledge for NLI and point to future work needed to reliably identify cases that benefit from such knowledge and to broaden evaluation across more models and languages.

Abstract

Natural Language Inference (NLI) is the task of determining whether a premise entails, contradicts, or is neutral with respect to a given hypothesis. The task is often framed as emulating human inferential processes, in which commonsense knowledge plays a major role. This study examines whether Large Language Models (LLMs) can generate useful commonsense axioms for Natural Language Inference, and evaluates their impact on performance using the SNLI and ANLI benchmarks with the Llama-3.1-70B and gpt-oss-120b models. We show that a hybrid approach, which selectively provides highly factual axioms based on judged helpfulness, yields consistent accuracy improvements of 1.99% to 6.88% across tested configurations, demonstrating the effectiveness of selective knowledge access for NLI. We also find that this targeted use of commonsense knowledge helps models overcome a bias toward the Neutral class by providing essential real-world context.

Filling the Gap: Is Commonsense Knowledge Generation useful for Natural Language Inference?

TL;DR

This work investigates whether Large Language Models can generate commonsense axioms to aid Natural Language Inference (NLI). It introduces a pipeline that generates axioms (P1), injects them before inference (P2), and compares against a direct inference baseline (P3), plus a hybrid selective-access strategy guided by a helpfulness rating. Through experiments on SNLI and ANLI using Llama-3.1-70B-Instruct and gpt-oss-120b, the hybrid approach achieves consistent improvements by effectively combining pre-prediction axiom injection with post-prediction reasoning. The findings highlight the value of targeted external commonsense knowledge for NLI and point to future work needed to reliably identify cases that benefit from such knowledge and to broaden evaluation across more models and languages.

Abstract

Natural Language Inference (NLI) is the task of determining whether a premise entails, contradicts, or is neutral with respect to a given hypothesis. The task is often framed as emulating human inferential processes, in which commonsense knowledge plays a major role. This study examines whether Large Language Models (LLMs) can generate useful commonsense axioms for Natural Language Inference, and evaluates their impact on performance using the SNLI and ANLI benchmarks with the Llama-3.1-70B and gpt-oss-120b models. We show that a hybrid approach, which selectively provides highly factual axioms based on judged helpfulness, yields consistent accuracy improvements of 1.99% to 6.88% across tested configurations, demonstrating the effectiveness of selective knowledge access for NLI. We also find that this targeted use of commonsense knowledge helps models overcome a bias toward the Neutral class by providing essential real-world context.

Paper Structure

This paper contains 16 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: SNLI instance illustrating the need for commonsense knowledge in inference prediction.
  • Figure 2: Experiment Methodology
  • Figure 3: $P1$ prompt format: We provide three in-context examples with the instructions for $P1$ prompt and provide the test premise-hypothesis pair at the end.
  • Figure 4: $P2$ prompt format: We provide one in-context example with the instructions for $P2$ prompt and provide the test premise-hypothesis pair and commonsense axiom from $P1$ at the end.
  • Figure 5: $P3$ prompt format: We provide the instructions with the test premise-hypothesis pair without adding a commonsense axiom or any in-context examples to obtain the P3 inference class prediction and the post-prediction commonsense-axiom. This design choice is intentional to contrast in-context learning with direct prompting modes.
  • ...and 2 more figures