Table of Contents
Fetching ...

Can Large Language Models Capture Video Game Engagement?

David Melhart, Matthew Barthet, Georgios N. Yannakakis

TL;DR

The paper investigates whether pretrained large language models can detect time-continuous viewer engagement from gameplay videos in a multimodal setting. By comparing open-source LLaVA variants and GPT-4o across text and image inputs with Chain-of-Thought prompting, the authors evaluate engagement changes between consecutive frames on the GameVibe-LLM subset (80 minutes, 20 games). They report that, while LLMs show human-like reasoning, their predictions of continuous engagement largely lag human annotations, with performance heavily dependent on the game, input modality, and prompting strategy; GPT-4o with multimodal few-shot prompting yields the strongest gains, up to about 47% relative improvement on some games and an average around 6% across games. The study discusses factors behind the gaps, such as visual readability and model priors, and outlines a roadmap including direct video inputs, memory, and retrieval-augmented approaches to improve automated affect labelling in dynamic media. Overall, the work establishes a baseline for LLM-based viewer engagement annotation and motivates future research toward richer multimodal integration and larger, more diverse affective datasets.

Abstract

Can out-of-the-box pretrained Large Language Models (LLMs) detect human affect successfully when observing a video? To address this question, for the first time, we evaluate comprehensively the capacity of popular LLMs to annotate and successfully predict continuous affect annotations of videos when prompted by a sequence of text and video frames in a multimodal fashion. Particularly in this paper, we test LLMs' ability to correctly label changes of in-game engagement in 80 minutes of annotated videogame footage from 20 first-person shooter games of the GameVibe corpus. We run over 2,400 experiments to investigate the impact of LLM architecture, model size, input modality, prompting strategy, and ground truth processing method on engagement prediction. Our findings suggest that while LLMs rightfully claim human-like performance across multiple domains, they generally fall behind capturing continuous experience annotations provided by humans. We examine some of the underlying causes for the relatively poor overall performance, highlight the cases where LLMs exceed expectations, and draw a roadmap for the further exploration of automated emotion labelling via LLMs.

Can Large Language Models Capture Video Game Engagement?

TL;DR

The paper investigates whether pretrained large language models can detect time-continuous viewer engagement from gameplay videos in a multimodal setting. By comparing open-source LLaVA variants and GPT-4o across text and image inputs with Chain-of-Thought prompting, the authors evaluate engagement changes between consecutive frames on the GameVibe-LLM subset (80 minutes, 20 games). They report that, while LLMs show human-like reasoning, their predictions of continuous engagement largely lag human annotations, with performance heavily dependent on the game, input modality, and prompting strategy; GPT-4o with multimodal few-shot prompting yields the strongest gains, up to about 47% relative improvement on some games and an average around 6% across games. The study discusses factors behind the gaps, such as visual readability and model priors, and outlines a roadmap including direct video inputs, memory, and retrieval-augmented approaches to improve automated affect labelling in dynamic media. Overall, the work establishes a baseline for LLM-based viewer engagement annotation and motivates future research toward richer multimodal integration and larger, more diverse affective datasets.

Abstract

Can out-of-the-box pretrained Large Language Models (LLMs) detect human affect successfully when observing a video? To address this question, for the first time, we evaluate comprehensively the capacity of popular LLMs to annotate and successfully predict continuous affect annotations of videos when prompted by a sequence of text and video frames in a multimodal fashion. Particularly in this paper, we test LLMs' ability to correctly label changes of in-game engagement in 80 minutes of annotated videogame footage from 20 first-person shooter games of the GameVibe corpus. We run over 2,400 experiments to investigate the impact of LLM architecture, model size, input modality, prompting strategy, and ground truth processing method on engagement prediction. Our findings suggest that while LLMs rightfully claim human-like performance across multiple domains, they generally fall behind capturing continuous experience annotations provided by humans. We examine some of the underlying causes for the relatively poor overall performance, highlight the cases where LLMs exceed expectations, and draw a roadmap for the further exploration of automated emotion labelling via LLMs.

Paper Structure

This paper contains 23 sections, 1 equation, 16 figures, 1 table.

Figures (16)

  • Figure 1: Clips in the GameVibe Dataset. List of game titles: (1) Apex Legends; (2) Blitz Brigade; (3) Borderlands 3; (4) Corridor 7; (5) Counter Strike 1.6; (6) CS:GO - Dust2; (7) CS:GO - Office; (8) Doom; (9) Insurgency; (10) Far Cry; (11) Fortnite; (12) Heretic; (13) Medal of Honor 2010; (14) Overwatch 2; (15) PUBG; (16) Medal of Honor 1999; (17) Team Fortress 2; (18) Void Bastards; (19) HROT; (20) Wolfram.
  • Figure 2: Overview of the evaluation experiments presented in this study. Independently of experimental setting, the downstream task is engagement prediction formulated as a binary preference. We use a combination of text prompts and/or video frames as input and task the LLMs to label engagement. To evaluate the models, we compare the generated labels to the ground truth labels from the annotated GameVibe corpus (see Section \ref{['sec:data']}). All LLMs are prompted with a Chain-of-Thought (CoT) strategy. In the Text Input setup, the input for the downstream task is text descriptions (see Section \ref{['sec:methods:text']}) whereas in the Multimodal settings, the input contains both images and text prompts (see Section \ref{['sec:methods:visual']}). In the few-shot experiments we generate reasoning examples based on ground truth evaluations. The examples are given to the LLM in addition to the base CoT prompt and the images (see Section \ref{['sec:methods:prompts']}). In all experimental settings we generate a description, comparison, reasoning, and a decision relating to an increase or decrease in engagement. We parse these outputs to derive the final binary engagement evaluation. $^\ast$Descriptions are only generated in the Multimodal Input settings.
  • Figure 3: Example clip from GameVibe showcasing the annotation interface using PAGAN and the RankTrace annotation tool for collecting unbounded, time continuous signals in real-time.
  • Figure 4: Application of the temporal shift ($\Delta t$) hyperparameter to the ground truth. The top red bar (Vision Input) shows an example of individual frames extracted from the gameplay video at a 3-second interval. The bottom green bar (Ground Truth) shows a $\Delta t$ of $-2$ seconds, which means that each window aggregates information $2$ seconds before and $1$ second after the corresponding video frame.
  • Figure 5: Sensitivity analysis across hyperparameters $\Delta t$ and $\theta$. The table presents $\Delta A$ values (relative gain in accuracy). $\Delta t$ is the relative shift of the time window to the frame, and $\theta$ is the binary threshold for the split criterion (i.e., increasing or decreasing engagement). The last column shows average $\Delta A$ across all games.
  • ...and 11 more figures