Table of Contents
Fetching ...

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

Yang Wang, Chenghao Xiao, Chia-Yi Hsiao, Zi Yan Chang, Chi-Li Chen, Tyler Loakman, Chenghua Lin

TL;DR

This work introduces Drivelology and the DrivelHub benchmark to probe whether LLMs can grasp deep pragmatic and culturally embedded meanings beyond surface coherence. By defining a taxonomy of five Drivelology categories and four evaluation tasks across six languages, the authors systematically test detection, tagging, narrative explanation, and narrative selection. Across a suite of zero-shot models, results reveal a downstream gap between fluent text generation and genuine interpretive understanding, amplified by hard reasoning tasks and cross-lingual challenges. The findings argue for targeted training paradigms and evaluation frameworks that explicitly address multi-layered social reasoning, with practical implications for safer and more creatively capable AI systems.

Abstract

We introduce Drivelology, a unique linguistic phenomenon characterised as "nonsense with depth" - utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a benchmark dataset of over 1,200+ meticulously curated and diverse examples across English, Mandarin, Spanish, French, Japanese, and Korean. Each example underwent careful expert review to verify its Drivelological characteristics, involving multiple rounds of discussion and adjudication to address disagreements. Using this dataset, we evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss implied rhetorical functions altogether. These findings highlight a deep representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

TL;DR

This work introduces Drivelology and the DrivelHub benchmark to probe whether LLMs can grasp deep pragmatic and culturally embedded meanings beyond surface coherence. By defining a taxonomy of five Drivelology categories and four evaluation tasks across six languages, the authors systematically test detection, tagging, narrative explanation, and narrative selection. Across a suite of zero-shot models, results reveal a downstream gap between fluent text generation and genuine interpretive understanding, amplified by hard reasoning tasks and cross-lingual challenges. The findings argue for targeted training paradigms and evaluation frameworks that explicitly address multi-layered social reasoning, with practical implications for safer and more creatively capable AI systems.

Abstract

We introduce Drivelology, a unique linguistic phenomenon characterised as "nonsense with depth" - utterances that are syntactically coherent yet pragmatically paradoxical, emotionally loaded, or rhetorically subversive. While such expressions may resemble surface-level nonsense, they encode implicit meaning requiring contextual inference, moral reasoning, or emotional interpretation. We find that current large language models (LLMs), despite excelling at many natural language processing (NLP) tasks, consistently fail to grasp the layered semantics of Drivelological text. To investigate this, we construct a benchmark dataset of over 1,200+ meticulously curated and diverse examples across English, Mandarin, Spanish, French, Japanese, and Korean. Each example underwent careful expert review to verify its Drivelological characteristics, involving multiple rounds of discussion and adjudication to address disagreements. Using this dataset, we evaluate a range of LLMs on classification, generation, and reasoning tasks. Our results reveal clear limitations of LLMs: models often confuse Drivelology with shallow nonsense, produce incoherent justifications, or miss implied rhetorical functions altogether. These findings highlight a deep representational gap in LLMs' pragmatic understanding and challenge the assumption that statistical fluency implies cognitive comprehension. We release our dataset and code to facilitate further research in modelling linguistic depth beyond surface-level coherence.

Paper Structure

This paper contains 28 sections, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Overview of the multi-stage process for constructing the DrivelHub dataset.
  • Figure 2: Overview of the Drivelology evaluation framework for LLMs. The figure illustrates four core tasks designed to systematically assess LLMs' ability to understand and reason about Drivelology: Drivelology Detection (binary classification), Drivelology Tagging (multi-label classification), Implicit Narrative Writing (generative reasoning), and Narrative Selection (multiple-choice question answering with both Easy and Hard settings).
  • Figure 3: Model performance on the multilingual DrivelHub dataset, contrasted by prompt language (English vs. Mandarin). Each reported score is the average performance over three distinct prompts to minimise variance.
  • Figure 4: A language-based breakdown of Narrative Selection (MCQA) accuracy from Table \ref{['tab:main']}. The charts disaggregate the overall Easy and Hard accuracy scores based on the original language of the Drivelology sample.
  • Figure 5: UpSet plot lex2014upset illustrating the overlap and intersection sizes among Drivelology categories. Each vertical bar represents the number of samples belonging to a specific combination of categories, as indicated by the connected black dots below. Categories include Misdirection, Paradox, Switchbait, Inversion, and Wordplay.
  • ...and 7 more figures