Game-MUG: Multimodal Oriented Game Situation Understanding and Commentary Generation Dataset
Zhihao Zhang, Feiqi Cao, Yingbin Mo, Yiran Zhang, Josiah Poon, Caren Han
TL;DR
This work introduces GAME-MUG, a multimodal dataset for esports that combines game event logs, caster audio, and audience chats/transcripts to support game-situation understanding and audience-engaged commentary generation. It collects 70k clips across 216 League of Legends matches from 2020–2022, with 3.7M audience chats and 15k annotated game events, and uses Whisper for transcription and GeMAPS for audio features. A joint integration baseline fuses text and audio through multimodal transformers and leverages event tokens to guide generation by decoders such as GPT-2 and Pythia, with GPT-4-derived ground-truth annotations. Experiments demonstrate that multimodal inputs improve event prediction and commentary quality, with DeBERTaV3 achieving best understanding performance and GPT-2-based generation delivering strong human-aligned outputs, supporting practical applications in esports broadcasting and analytics.
Abstract
The dynamic nature of esports makes the situation relatively complicated for average viewers. Esports broadcasting involves game expert casters, but the caster-dependent game commentary is not enough to fully understand the game situation. It will be richer by including diverse multimodal esports information, including audiences' talks/emotions, game audio, and game match event information. This paper introduces GAME-MUG, a new multimodal game situation understanding and audience-engaged commentary generation dataset and its strong baseline. Our dataset is collected from 2020-2022 LOL game live streams from YouTube and Twitch, and includes multimodal esports game information, including text, audio, and time-series event logs, for detecting the game situation. In addition, we also propose a new audience conversation augmented commentary dataset by covering the game situation and audience conversation understanding, and introducing a robust joint multimodal dual learning model as a baseline. We examine the model's game situation/event understanding ability and commentary generation capability to show the effectiveness of the multimodal aspects coverage and the joint integration learning approach.
