Table of Contents
Fetching ...

Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding

Yeonkyoung So, Gyuseong Lee, Sungmok Jung, Joonhak Lee, JiA Kang, Sangho Kim, Jaejin Lee

TL;DR

Thunder-NUBench introduces a sentence-level negation benchmark that formalizes a truth-functional standard negation operator $Neg(\cdot)$ and contrasts it with local negation, contradiction, and paraphrase to probe semantic reasoning in LLMs. The dataset comprises manually curated standard negations and a four-option MCQ evaluation built from English sources (Hover and Wikipedia), with rigorous multi-stage review and careful data-generation guidelines. Empirical results across model families (2–3B, 7–8B, and API models) show that few-shot prompting and supervised fine-tuning with LoRA improve performance, yet models frequently confuse local negation with standard negation, especially under complex sentence structures. Thunder-NUBench thus provides a robust diagnostic tool for semantic negation understanding, enabling targeted improvements in reasoning capabilities across diverse model types, while acknowledging language- and domain-specific limitations and the need for multilingual extension.

Abstract

Negation is a fundamental linguistic phenomenon that poses ongoing challenges for Large Language Models (LLMs), particularly in tasks requiring deep semantic understanding. Current benchmarks often treat negation as a minor detail within broader tasks, such as natural language inference. Consequently, there is a lack of benchmarks specifically designed to evaluate comprehension of negation. In this work, we introduce Thunder-NUBench, a novel benchmark explicitly created to assess sentence-level understanding of negation in LLMs. Thunder-NUBench goes beyond merely identifying surface-level cues by contrasting standard negation with structurally diverse alternatives, such as local negation, contradiction, and paraphrase. This benchmark includes manually curated sentence-negation pairs and a multiple-choice dataset, allowing for a comprehensive evaluation of models' understanding of negation.

Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding

TL;DR

Thunder-NUBench introduces a sentence-level negation benchmark that formalizes a truth-functional standard negation operator and contrasts it with local negation, contradiction, and paraphrase to probe semantic reasoning in LLMs. The dataset comprises manually curated standard negations and a four-option MCQ evaluation built from English sources (Hover and Wikipedia), with rigorous multi-stage review and careful data-generation guidelines. Empirical results across model families (2–3B, 7–8B, and API models) show that few-shot prompting and supervised fine-tuning with LoRA improve performance, yet models frequently confuse local negation with standard negation, especially under complex sentence structures. Thunder-NUBench thus provides a robust diagnostic tool for semantic negation understanding, enabling targeted improvements in reasoning capabilities across diverse model types, while acknowledging language- and domain-specific limitations and the need for multilingual extension.

Abstract

Negation is a fundamental linguistic phenomenon that poses ongoing challenges for Large Language Models (LLMs), particularly in tasks requiring deep semantic understanding. Current benchmarks often treat negation as a minor detail within broader tasks, such as natural language inference. Consequently, there is a lack of benchmarks specifically designed to evaluate comprehension of negation. In this work, we introduce Thunder-NUBench, a novel benchmark explicitly created to assess sentence-level understanding of negation in LLMs. Thunder-NUBench goes beyond merely identifying surface-level cues by contrasting standard negation with structurally diverse alternatives, such as local negation, contradiction, and paraphrase. This benchmark includes manually curated sentence-negation pairs and a multiple-choice dataset, allowing for a comprehensive evaluation of models' understanding of negation.

Paper Structure

This paper contains 59 sections, 5 figures, 23 tables.

Figures (5)

  • Figure 1: An example of Thunder-NUBench multiple-choice evaluation task, where the underlined text indicates the main verb phrase of each sentence, and the red text marks the negated part.
  • Figure 2: Dataset generation process.
  • Figure 3: Model performance on Thunder-NUBench. Circles (blue) represent the average performance of 2-3B models, squares (purple) indicate the average for 7-8B models, upward triangles (orange) signify the average of base models, and downward triangles (red) denote the average of instruction-tuned models. Stars (green) represent API models.
  • Figure 4: Model performance on Thunder-NUBench with definition prompt. Circles (blue) represent the average performance of 2-3B models, squares (purple) indicate the average for 7-8B models, upward triangles (orange) signify the average of base models, and downward triangles (red) denote the average of instruction-tuned models. Stars (green) represent API models.
  • Figure 5: Model performance on Thunder-NUBench with detail prompt. Circles (blue) represent the average performance of 2-3B models, squares (purple) indicate the average for 7-8B models, upward triangles (orange) signify the average of base models, and downward triangles (red) denote the average of instruction-tuned models. Stars (green) represent API models.