Table of Contents
Fetching ...

MuTual: A Dataset for Multi-Turn Dialogue Reasoning

Leyang Cui, Yu Wu, Shujie Liu, Yue Zhang, Ming Zhou

TL;DR

MuTual introduces a challenging multi-turn dialogue reasoning dataset derived from Chinese English listening comprehension data, emphasizing reasoning over surface matching. The authors provide thorough dataset construction details, rationale for a MuTualPlus extension with safe responses, and extensive evaluations showing that even strong pre-trained models struggle to reach human-level reasoning. Key findings include RoBERTa-MC as the top performer yet still far from human, and the heightened difficulty of MuTualPlus, underscoring the need for advanced reasoning capabilities in chat systems. The work offers valuable insights into reasoning types, context dependence, and transfer limitations, aiming to spur progress in robust, reasoning-aware dialogue agents.

Abstract

Non-task oriented dialogue systems have achieved great success in recent years due to largely accessible conversation data and the development of deep learning techniques. Given a context, current systems are able to yield a relevant and fluent response, but sometimes make logical mistakes because of weak reasoning capabilities. To facilitate the conversation reasoning research, we introduce MuTual, a novel dataset for Multi-Turn dialogue Reasoning, consisting of 8,860 manually annotated dialogues based on Chinese student English listening comprehension exams. Compared to previous benchmarks for non-task oriented dialogue systems, MuTual is much more challenging since it requires a model that can handle various reasoning problems. Empirical results show that state-of-the-art methods only reach 71%, which is far behind the human performance of 94%, indicating that there is ample room for improving reasoning ability. MuTual is available at https://github.com/Nealcly/MuTual.

MuTual: A Dataset for Multi-Turn Dialogue Reasoning

TL;DR

MuTual introduces a challenging multi-turn dialogue reasoning dataset derived from Chinese English listening comprehension data, emphasizing reasoning over surface matching. The authors provide thorough dataset construction details, rationale for a MuTualPlus extension with safe responses, and extensive evaluations showing that even strong pre-trained models struggle to reach human-level reasoning. Key findings include RoBERTa-MC as the top performer yet still far from human, and the heightened difficulty of MuTualPlus, underscoring the need for advanced reasoning capabilities in chat systems. The work offers valuable insights into reasoning types, context dependence, and transfer limitations, aiming to spur progress in robust, reasoning-aware dialogue agents.

Abstract

Non-task oriented dialogue systems have achieved great success in recent years due to largely accessible conversation data and the development of deep learning techniques. Given a context, current systems are able to yield a relevant and fluent response, but sometimes make logical mistakes because of weak reasoning capabilities. To facilitate the conversation reasoning research, we introduce MuTual, a novel dataset for Multi-Turn dialogue Reasoning, consisting of 8,860 manually annotated dialogues based on Chinese student English listening comprehension exams. Compared to previous benchmarks for non-task oriented dialogue systems, MuTual is much more challenging since it requires a model that can handle various reasoning problems. Empirical results show that state-of-the-art methods only reach 71%, which is far behind the human performance of 94%, indicating that there is ample room for improving reasoning ability. MuTual is available at https://github.com/Nealcly/MuTual.

Paper Structure

This paper contains 13 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: B is incorrect because there is no reason to apologize. C and D can be excluded because the relationship between two speakers are waiter and customer based on the context.
  • Figure 2: The process of modifying the listening comprehension test data.
  • Figure 3: Examples from the MuTual dataset. All choices are relevant to context, but only one of them is logic correct. Some negative choices might be reasonable in extreme cases, but the positive one is the most appropriate. Clue words are purple and underline. More examples are shown in Appendix A.
  • Figure 4: BERT-MC and RoBERTa-MC performance on different reasoning types.
  • Figure 5: Error analysis. ✘ indicates RoBERTa-MC's prediction.
  • ...and 1 more figures