Table of Contents
Fetching ...

Game-MUG: Multimodal Oriented Game Situation Understanding and Commentary Generation Dataset

Zhihao Zhang, Feiqi Cao, Yingbin Mo, Yiran Zhang, Josiah Poon, Caren Han

TL;DR

This work introduces GAME-MUG, a multimodal dataset for esports that combines game event logs, caster audio, and audience chats/transcripts to support game-situation understanding and audience-engaged commentary generation. It collects 70k clips across 216 League of Legends matches from 2020–2022, with 3.7M audience chats and 15k annotated game events, and uses Whisper for transcription and GeMAPS for audio features. A joint integration baseline fuses text and audio through multimodal transformers and leverages event tokens to guide generation by decoders such as GPT-2 and Pythia, with GPT-4-derived ground-truth annotations. Experiments demonstrate that multimodal inputs improve event prediction and commentary quality, with DeBERTaV3 achieving best understanding performance and GPT-2-based generation delivering strong human-aligned outputs, supporting practical applications in esports broadcasting and analytics.

Abstract

The dynamic nature of esports makes the situation relatively complicated for average viewers. Esports broadcasting involves game expert casters, but the caster-dependent game commentary is not enough to fully understand the game situation. It will be richer by including diverse multimodal esports information, including audiences' talks/emotions, game audio, and game match event information. This paper introduces GAME-MUG, a new multimodal game situation understanding and audience-engaged commentary generation dataset and its strong baseline. Our dataset is collected from 2020-2022 LOL game live streams from YouTube and Twitch, and includes multimodal esports game information, including text, audio, and time-series event logs, for detecting the game situation. In addition, we also propose a new audience conversation augmented commentary dataset by covering the game situation and audience conversation understanding, and introducing a robust joint multimodal dual learning model as a baseline. We examine the model's game situation/event understanding ability and commentary generation capability to show the effectiveness of the multimodal aspects coverage and the joint integration learning approach.

Game-MUG: Multimodal Oriented Game Situation Understanding and Commentary Generation Dataset

TL;DR

This work introduces GAME-MUG, a multimodal dataset for esports that combines game event logs, caster audio, and audience chats/transcripts to support game-situation understanding and audience-engaged commentary generation. It collects 70k clips across 216 League of Legends matches from 2020–2022, with 3.7M audience chats and 15k annotated game events, and uses Whisper for transcription and GeMAPS for audio features. A joint integration baseline fuses text and audio through multimodal transformers and leverages event tokens to guide generation by decoders such as GPT-2 and Pythia, with GPT-4-derived ground-truth annotations. Experiments demonstrate that multimodal inputs improve event prediction and commentary quality, with DeBERTaV3 achieving best understanding performance and GPT-2-based generation delivering strong human-aligned outputs, supporting practical applications in esports broadcasting and analytics.

Abstract

The dynamic nature of esports makes the situation relatively complicated for average viewers. Esports broadcasting involves game expert casters, but the caster-dependent game commentary is not enough to fully understand the game situation. It will be richer by including diverse multimodal esports information, including audiences' talks/emotions, game audio, and game match event information. This paper introduces GAME-MUG, a new multimodal game situation understanding and audience-engaged commentary generation dataset and its strong baseline. Our dataset is collected from 2020-2022 LOL game live streams from YouTube and Twitch, and includes multimodal esports game information, including text, audio, and time-series event logs, for detecting the game situation. In addition, we also propose a new audience conversation augmented commentary dataset by covering the game situation and audience conversation understanding, and introducing a robust joint multimodal dual learning model as a baseline. We examine the model's game situation/event understanding ability and commentary generation capability to show the effectiveness of the multimodal aspects coverage and the joint integration learning approach.
Paper Structure (23 sections, 7 figures, 9 tables)

This paper contains 23 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: The visualisation for keyword analysis with top 15 words from Kill and Tower Event, where the time window is 30s. Entities related to each event, Kill and Tower, are remarkably different, such as skill 'cocoon' for Kill and 'apprehend' for Tower.
  • Figure 2: The concurrent plot for audience chat analysis with the numbers of emotes, emojis, and game events along the same timeline. A positive correlation can be observed between the number of audience chat emojis and the number of game events happening within the same time window.
  • Figure 3: Joint integration framework of Game Situation Understanding and Game Commentary Generation
  • Figure 4: Screenshot of a human evaluation sample. Workers are shown the caster's speech with truncated audience chats on the top left. We provide the generated commentaries on the bottom left. The worker ranks these two commentaries in terms of the inclusion of the game event, coherence and overall quality. More evaluations samples can be found in Appendix B.
  • Figure 5: Screenshot of the commentary samples.
  • ...and 2 more figures