KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus

Xiaoming Shi; Zeming Liu; Yiming Lei; Chenkai Zhang; Haitao Leng; Chuan Wang; Qingjie Liu; Wanxiang Che; Shaoguo Liu; Size Li; Yunhong Wang

KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus

Xiaoming Shi, Zeming Liu, Yiming Lei, Chenkai Zhang, Haitao Leng, Chuan Wang, Qingjie Liu, Wanxiang Che, Shaoguo Liu, Size Li, Yunhong Wang

TL;DR

This work tackles the challenge of generating video-driven multilingual mixed-type dialogues across multiple participants. It introduces KwaiChat, a large, multilingual, mixed-type video dialogue corpus with 93,209 videos and 246,080 dialogues spanning 4 dialogue types, 30 domains, 4 languages, and 13 topics, and benchmarks seven LLMs (including GPT-4o) across zero-shot, few-shot, and fine-tuning settings. The authors propose an adaptive data balancing method to mitigate long-tail topic distribution and demonstrate that while multimodal models outperform text-only baselines, current models still struggle with this complex task. The dataset and comprehensive evaluation provide a valuable resource and baseline for advancing video-driven multilingual mixed-type dialogue systems, with implications for education, collaboration, and multilingual AI research. Future work highlighted includes cross-lingual research and extending coverage to low-resource languages and safety considerations.

Abstract

Video-based dialogue systems, such as education assistants, have compelling application value, thereby garnering growing interest. However, the current video-based dialogue systems are limited by their reliance on a single dialogue type, which hinders their versatility in practical applications across a range of scenarios, including question-answering, emotional dialog, etc. In this paper, we identify this challenge as how to generate video-driven multilingual mixed-type dialogues. To mitigate this challenge, we propose a novel task and create a human-to-human video-driven multilingual mixed-type dialogue corpus, termed KwaiChat, containing a total of 93,209 videos and 246,080 dialogues, across 4 dialogue types, 30 domains, 4 languages, and 13 topics. Additionally, we establish baseline models on KwaiChat. An extensive analysis of 7 distinct LLMs on KwaiChat reveals that GPT-4o achieves the best performance but still cannot perform well in this situation even with the help of in-context learning and fine-tuning, which indicates that the task is not trivial and needs further research.

KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus

TL;DR

Abstract

KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)