Table of Contents
Fetching ...

JMultiWOZ: A Large-Scale Japanese Multi-Domain Task-Oriented Dialogue Dataset

Atsumoto Ohashi, Ryu Hirai, Shinya Iizuka, Ryuichiro Higashinaka

TL;DR

This paper presents JMultiWOZ, the first large-scale Japanese multi-domain task-oriented dialogue dataset, enabling benchmarked research in dialogue state tracking and response generation for Japanese. It details a from-scratch data collection pipeline (ontology, backend database, user goals, WOZ dialogue collection, and full dialogue-state annotation) and implements rigorous quality controls. Benchmark results show JMultiWOZ reaches parity with the English MultiWOZ2.2 in task complexity, while revealing limitations of current LLMs in handling Japanese task-oriented dialogues. The dataset is poised to advance Japanese dialogue systems and support multilingual task-oriented dialogue research with future work on broader model support and DA annotations.

Abstract

Dialogue datasets are crucial for deep learning-based task-oriented dialogue system research. While numerous English language multi-domain task-oriented dialogue datasets have been developed and contributed to significant advancements in task-oriented dialogue systems, such a dataset does not exist in Japanese, and research in this area is limited compared to that in English. In this study, towards the advancement of research and development of task-oriented dialogue systems in Japanese, we constructed JMultiWOZ, the first Japanese language large-scale multi-domain task-oriented dialogue dataset. Using JMultiWOZ, we evaluated the dialogue state tracking and response generation capabilities of the state-of-the-art methods on the existing major English benchmark dataset MultiWOZ2.2 and the latest large language model (LLM)-based methods. Our evaluation results demonstrated that JMultiWOZ provides a benchmark that is on par with MultiWOZ2.2. In addition, through evaluation experiments of interactive dialogues with the models and human participants, we identified limitations in the task completion capabilities of LLMs in Japanese.

JMultiWOZ: A Large-Scale Japanese Multi-Domain Task-Oriented Dialogue Dataset

TL;DR

This paper presents JMultiWOZ, the first large-scale Japanese multi-domain task-oriented dialogue dataset, enabling benchmarked research in dialogue state tracking and response generation for Japanese. It details a from-scratch data collection pipeline (ontology, backend database, user goals, WOZ dialogue collection, and full dialogue-state annotation) and implements rigorous quality controls. Benchmark results show JMultiWOZ reaches parity with the English MultiWOZ2.2 in task complexity, while revealing limitations of current LLMs in handling Japanese task-oriented dialogues. The dataset is poised to advance Japanese dialogue systems and support multilingual task-oriented dialogue research with future work on broader model support and DA annotations.

Abstract

Dialogue datasets are crucial for deep learning-based task-oriented dialogue system research. While numerous English language multi-domain task-oriented dialogue datasets have been developed and contributed to significant advancements in task-oriented dialogue systems, such a dataset does not exist in Japanese, and research in this area is limited compared to that in English. In this study, towards the advancement of research and development of task-oriented dialogue systems in Japanese, we constructed JMultiWOZ, the first Japanese language large-scale multi-domain task-oriented dialogue dataset. Using JMultiWOZ, we evaluated the dialogue state tracking and response generation capabilities of the state-of-the-art methods on the existing major English benchmark dataset MultiWOZ2.2 and the latest large language model (LLM)-based methods. Our evaluation results demonstrated that JMultiWOZ provides a benchmark that is on par with MultiWOZ2.2. In addition, through evaluation experiments of interactive dialogues with the models and human participants, we identified limitations in the task completion capabilities of LLMs in Japanese.
Paper Structure (33 sections, 13 figures, 9 tables)

This paper contains 33 sections, 13 figures, 9 tables.

Figures (13)

  • Figure 1: An example of dialogue across two domains: restaurants and taxis. The gray and green message bubbles represent the utterances of the user and the wizard, respectively. The red and blue boxes indicate the annotation of the dialogue state and the database results, respectively. The bubble for each utterance contains both the original Japanese utterance and its English translation by the authors.
  • Figure 2: Web UI of wizard for dialogue collection: (A): A search form for an entity from the backend database, (B): An interface displaying detailed information about the selected entity and a reservation form for the entity, (C): An interface for chatting with the user.
  • Figure 3: The distribution of dialogue lengths, divided into dialogues containing only one domain (single-domain) and dialogues containing two or more domains (multi-domain).
  • Figure 4: T5 Pipeline bang-etal-2023-task. (1) First, the dialogue state is predicted from the given dialogue context, and (2) the result is added to the input to generate the final response. Both (1) and (2) are performed on the same model.
  • Figure 5: LLM pipeline hudecek-dusek-2023-large in zero-shot setting. First, (1) the current active domain is estimated from the dialogue context. Next, (2) the dialogue state is tracked, and (3) the response is generated for that domain using a prompt focused on that domain.
  • ...and 8 more figures