Table of Contents
Fetching ...

KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus

Xiaoming Shi, Zeming Liu, Yiming Lei, Chenkai Zhang, Haitao Leng, Chuan Wang, Qingjie Liu, Wanxiang Che, Shaoguo Liu, Size Li, Yunhong Wang

TL;DR

This work tackles the challenge of generating video-driven multilingual mixed-type dialogues across multiple participants. It introduces KwaiChat, a large, multilingual, mixed-type video dialogue corpus with 93,209 videos and 246,080 dialogues spanning 4 dialogue types, 30 domains, 4 languages, and 13 topics, and benchmarks seven LLMs (including GPT-4o) across zero-shot, few-shot, and fine-tuning settings. The authors propose an adaptive data balancing method to mitigate long-tail topic distribution and demonstrate that while multimodal models outperform text-only baselines, current models still struggle with this complex task. The dataset and comprehensive evaluation provide a valuable resource and baseline for advancing video-driven multilingual mixed-type dialogue systems, with implications for education, collaboration, and multilingual AI research. Future work highlighted includes cross-lingual research and extending coverage to low-resource languages and safety considerations.

Abstract

Video-based dialogue systems, such as education assistants, have compelling application value, thereby garnering growing interest. However, the current video-based dialogue systems are limited by their reliance on a single dialogue type, which hinders their versatility in practical applications across a range of scenarios, including question-answering, emotional dialog, etc. In this paper, we identify this challenge as how to generate video-driven multilingual mixed-type dialogues. To mitigate this challenge, we propose a novel task and create a human-to-human video-driven multilingual mixed-type dialogue corpus, termed KwaiChat, containing a total of 93,209 videos and 246,080 dialogues, across 4 dialogue types, 30 domains, 4 languages, and 13 topics. Additionally, we establish baseline models on KwaiChat. An extensive analysis of 7 distinct LLMs on KwaiChat reveals that GPT-4o achieves the best performance but still cannot perform well in this situation even with the help of in-context learning and fine-tuning, which indicates that the task is not trivial and needs further research.

KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus

TL;DR

This work tackles the challenge of generating video-driven multilingual mixed-type dialogues across multiple participants. It introduces KwaiChat, a large, multilingual, mixed-type video dialogue corpus with 93,209 videos and 246,080 dialogues spanning 4 dialogue types, 30 domains, 4 languages, and 13 topics, and benchmarks seven LLMs (including GPT-4o) across zero-shot, few-shot, and fine-tuning settings. The authors propose an adaptive data balancing method to mitigate long-tail topic distribution and demonstrate that while multimodal models outperform text-only baselines, current models still struggle with this complex task. The dataset and comprehensive evaluation provide a valuable resource and baseline for advancing video-driven multilingual mixed-type dialogue systems, with implications for education, collaboration, and multilingual AI research. Future work highlighted includes cross-lingual research and extending coverage to low-resource languages and safety considerations.

Abstract

Video-based dialogue systems, such as education assistants, have compelling application value, thereby garnering growing interest. However, the current video-based dialogue systems are limited by their reliance on a single dialogue type, which hinders their versatility in practical applications across a range of scenarios, including question-answering, emotional dialog, etc. In this paper, we identify this challenge as how to generate video-driven multilingual mixed-type dialogues. To mitigate this challenge, we propose a novel task and create a human-to-human video-driven multilingual mixed-type dialogue corpus, termed KwaiChat, containing a total of 93,209 videos and 246,080 dialogues, across 4 dialogue types, 30 domains, 4 languages, and 13 topics. Additionally, we establish baseline models on KwaiChat. An extensive analysis of 7 distinct LLMs on KwaiChat reveals that GPT-4o achieves the best performance but still cannot perform well in this situation even with the help of in-context learning and fine-tuning, which indicates that the task is not trivial and needs further research.

Paper Structure

This paper contains 23 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: An example of KwaiChat. The image above is captured from a video. Below the video, there are comments in four languages, and a Chinese dialogue is shown, with annotated topics and corresponding dialogue types.
  • Figure 2: Domains of KwaiChat.
  • Figure 3: The domains, languages, topics, and dialogue types of KwaiChat. The first column lists the domains. The second column lists the four languages. The third column lists the topics. The fourth column lists the dialogue types.
  • Figure 4: Examples of KwaiChat.