Table of Contents
Fetching ...

Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches

Anum Afzal, Yuki Saito, Hiroya Takamura, Katsuhito Sudoh, Shinnosuke Takamichi, Graham Neubig, Florian Matthes, Tatsuya Ishigaki

TL;DR

This work investigates whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed, and proposes two prompting-based decoding strategies: a fixed-interval approach and a novel dynamic interval-based decoding approach.

Abstract

Real-time video commentary generation provides textual descriptions of ongoing events in videos. It supports accessibility and engagement in domains such as sports, esports, and livestreaming. Commentary generation involves two essential decisions: what to say and when to say it. While recent prompting-based approaches using multimodal large language models (MLLMs) have shown strong performance in content generation, they largely ignore the timing aspect. We investigate whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed. We propose two prompting-based decoding strategies: 1) a fixed-interval approach, and 2) a novel dynamic interval-based decoding approach that adjusts the next prediction timing based on the estimated duration of the previous utterance. Both methods enable pause-aware generation without any fine-tuning. Experiments on Japanese and English datasets of racing and fighting games show that the dynamic interval-based decoding can generate commentary more closely aligned with human utterance timing and content using prompting alone. We release a multilingual benchmark dataset, trained models, and implementations to support future research on real-time video commentary generation.

Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches

TL;DR

This work investigates whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed, and proposes two prompting-based decoding strategies: a fixed-interval approach and a novel dynamic interval-based decoding approach.

Abstract

Real-time video commentary generation provides textual descriptions of ongoing events in videos. It supports accessibility and engagement in domains such as sports, esports, and livestreaming. Commentary generation involves two essential decisions: what to say and when to say it. While recent prompting-based approaches using multimodal large language models (MLLMs) have shown strong performance in content generation, they largely ignore the timing aspect. We investigate whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed. We propose two prompting-based decoding strategies: 1) a fixed-interval approach, and 2) a novel dynamic interval-based decoding approach that adjusts the next prediction timing based on the estimated duration of the previous utterance. Both methods enable pause-aware generation without any fine-tuning. Experiments on Japanese and English datasets of racing and fighting games show that the dynamic interval-based decoding can generate commentary more closely aligned with human utterance timing and content using prompting alone. We release a multilingual benchmark dataset, trained models, and implementations to support future research on real-time video commentary generation.
Paper Structure (26 sections, 3 figures, 10 tables)

This paper contains 26 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: An example of automatically generated commentary for a racing game shown as a subtitle on video.
  • Figure 2: Illustration of both Fixed Interval-based and Dynamic Interval-based decoding strategies queried at uniform intervals of $t$ seconds. For Fixed Interval-based strategy, $k$ is fixed as 0.
  • Figure 3: Contextual similarities between LLM-generated and Reference Commentaries compute over 10% segments of the whole video.