Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding

Yeonkyoung So; Gyuseong Lee; Sungmok Jung; Joonhak Lee; JiA Kang; Sangho Kim; Jaejin Lee

Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding

Yeonkyoung So, Gyuseong Lee, Sungmok Jung, Joonhak Lee, JiA Kang, Sangho Kim, Jaejin Lee

TL;DR

Thunder-NUBench introduces a sentence-level negation benchmark that formalizes a truth-functional standard negation operator $Neg(\cdot)$ and contrasts it with local negation, contradiction, and paraphrase to probe semantic reasoning in LLMs. The dataset comprises manually curated standard negations and a four-option MCQ evaluation built from English sources (Hover and Wikipedia), with rigorous multi-stage review and careful data-generation guidelines. Empirical results across model families (2–3B, 7–8B, and API models) show that few-shot prompting and supervised fine-tuning with LoRA improve performance, yet models frequently confuse local negation with standard negation, especially under complex sentence structures. Thunder-NUBench thus provides a robust diagnostic tool for semantic negation understanding, enabling targeted improvements in reasoning capabilities across diverse model types, while acknowledging language- and domain-specific limitations and the need for multilingual extension.

Abstract

Negation is a fundamental linguistic phenomenon that poses ongoing challenges for Large Language Models (LLMs), particularly in tasks requiring deep semantic understanding. Current benchmarks often treat negation as a minor detail within broader tasks, such as natural language inference. Consequently, there is a lack of benchmarks specifically designed to evaluate comprehension of negation. In this work, we introduce Thunder-NUBench, a novel benchmark explicitly created to assess sentence-level understanding of negation in LLMs. Thunder-NUBench goes beyond merely identifying surface-level cues by contrasting standard negation with structurally diverse alternatives, such as local negation, contradiction, and paraphrase. This benchmark includes manually curated sentence-negation pairs and a multiple-choice dataset, allowing for a comprehensive evaluation of models' understanding of negation.

Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding

TL;DR

Thunder-NUBench introduces a sentence-level negation benchmark that formalizes a truth-functional standard negation operator

and contrasts it with local negation, contradiction, and paraphrase to probe semantic reasoning in LLMs. The dataset comprises manually curated standard negations and a four-option MCQ evaluation built from English sources (Hover and Wikipedia), with rigorous multi-stage review and careful data-generation guidelines. Empirical results across model families (2–3B, 7–8B, and API models) show that few-shot prompting and supervised fine-tuning with LoRA improve performance, yet models frequently confuse local negation with standard negation, especially under complex sentence structures. Thunder-NUBench thus provides a robust diagnostic tool for semantic negation understanding, enabling targeted improvements in reasoning capabilities across diverse model types, while acknowledging language- and domain-specific limitations and the need for multilingual extension.

Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding

TL;DR

Abstract

Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)