Table of Contents
Fetching ...

BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

Chuyuan Li, Giuseppe Carenini

TL;DR

BeDiscovER presents a comprehensive, multilingual benchmark suite for assessing discourse understanding in modern reasoning-oriented LLMs across five tasks and 52 datasets. It uses QA-style prompting to evaluate lexical, sentential, and document-level discourse phenomena, including novel challenges like discourse particle disambiguation and dialogue parsing. Across open-source models (Qwen3, DeepSeek-R1) and GPT-5-mini, results show strong arithmetic temporal reasoning yet persistent difficulties with full document reasoning and subtle discourse relations, underscoring gaps for future discourse-aware training. The benchmark serves as a practical, unified resource to diagnose, compare, and guide improvements in discourse understanding for scalable language models.

Abstract

We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just''), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.

BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

TL;DR

BeDiscovER presents a comprehensive, multilingual benchmark suite for assessing discourse understanding in modern reasoning-oriented LLMs across five tasks and 52 datasets. It uses QA-style prompting to evaluate lexical, sentential, and document-level discourse phenomena, including novel challenges like discourse particle disambiguation and dialogue parsing. Across open-source models (Qwen3, DeepSeek-R1) and GPT-5-mini, results show strong arithmetic temporal reasoning yet persistent difficulties with full document reasoning and subtle discourse relations, underscoring gaps for future discourse-aware training. The benchmark serves as a practical, unified resource to diagnose, compare, and guide improvements in discourse understanding for scalable language models.

Abstract

We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just''), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.

Paper Structure

This paper contains 42 sections, 11 figures, 41 tables.

Figures (11)

  • Figure 1: Performance comparison on Task (1) Discourse Marker Understanding on Just-Manual dataset with different reasoning mode or effort.
  • Figure 2: Performance comparison on Task (1) Discourse Marker Understanding on Just-Subtitle dataset with different reasoning mode or effort.
  • Figure 3: Performance comparison on Task (1) Discourse Marker Understanding on Otherwise dataset with different reasoning mode or effort.
  • Figure 4: Performance comparison on Task (2) Temporal Reasoning on TBD dataset with different reasoning mode or effort.
  • Figure 5: Performance comparison on Task (2) Temporal Reasoning on TDD-Man dataset with different reasoning mode or effort.
  • ...and 6 more figures