Table of Contents
Fetching ...

Generation of Probabilistic Synthetic Data for Serious Games: A Case Study on Cyberbullying

Jaime Pérez, Mario Castro, Edmond Awad, Gregorio López

TL;DR

The paper addresses the need for synthetic data in serious games by proposing a modular simulator that generates probabilistic data for interactive narratives. It combines Bayesian Networks to inject external knowledge with an Item Response Theory–based decision model to simulate agent interactions, demonstrated on a cyberbullying game (RAYUELA). The authors show identifiability and robustness of the generated data through hierarchical Bayesian inference in a BN-informed, two-cluster risk framework, using a 665-person survey to calibrate the model and 500 synthetic players across 15 questions. The approach offers a scalable way to anticipate data modelling, improve privacy and fairness, and accelerate development of serious games while enabling clustering of players by risk propensity. The work provides a concrete architecture and methodology that others can adapt to different serious-game domains and datasets, with potential for real-player validation in future work.

Abstract

Synthetic data generation has been a growing area of research in recent years. However, its potential applications in serious games have not been thoroughly explored. Advances in this field could anticipate data modelling and analysis, as well as speed up the development process. To try to fill this gap in the literature, we propose a simulator architecture for generating probabilistic synthetic data for serious games based on interactive narratives. This architecture is designed to be generic and modular so that it can be used by other researchers on similar problems. To simulate the interaction of synthetic players with questions, we use a cognitive testing model based on the Item Response Theory framework. We also show how probabilistic graphical models (in particular Bayesian networks) can be used to introduce expert knowledge and external data into the simulation. Finally, we apply the proposed architecture and methods in a use case of a serious game focused on cyberbullying. We perform Bayesian inference experiments using a hierarchical model to demonstrate the identifiability and robustness of the generated data.

Generation of Probabilistic Synthetic Data for Serious Games: A Case Study on Cyberbullying

TL;DR

The paper addresses the need for synthetic data in serious games by proposing a modular simulator that generates probabilistic data for interactive narratives. It combines Bayesian Networks to inject external knowledge with an Item Response Theory–based decision model to simulate agent interactions, demonstrated on a cyberbullying game (RAYUELA). The authors show identifiability and robustness of the generated data through hierarchical Bayesian inference in a BN-informed, two-cluster risk framework, using a 665-person survey to calibrate the model and 500 synthetic players across 15 questions. The approach offers a scalable way to anticipate data modelling, improve privacy and fairness, and accelerate development of serious games while enabling clustering of players by risk propensity. The work provides a concrete architecture and methodology that others can adapt to different serious-game domains and datasets, with potential for real-player validation in future work.

Abstract

Synthetic data generation has been a growing area of research in recent years. However, its potential applications in serious games have not been thoroughly explored. Advances in this field could anticipate data modelling and analysis, as well as speed up the development process. To try to fill this gap in the literature, we propose a simulator architecture for generating probabilistic synthetic data for serious games based on interactive narratives. This architecture is designed to be generic and modular so that it can be used by other researchers on similar problems. To simulate the interaction of synthetic players with questions, we use a cognitive testing model based on the Item Response Theory framework. We also show how probabilistic graphical models (in particular Bayesian networks) can be used to introduce expert knowledge and external data into the simulation. Finally, we apply the proposed architecture and methods in a use case of a serious game focused on cyberbullying. We perform Bayesian inference experiments using a hierarchical model to demonstrate the identifiability and robustness of the generated data.
Paper Structure (12 sections, 4 equations, 10 figures, 2 tables)

This paper contains 12 sections, 4 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Simulator's overall architecture and components. The generated agents respond to the environment and make a decision according to its profile in a non-deterministic manner. The blue boxes represent external information that is fed into the simulation. The grey boxes represent the internal states and models of the synthetic agents.
  • Figure 2: Visualisation of the probability $p_{ij}$ equation values as a function of $\alpha$ and $\beta$ for the dichotomous/binary case. The values of the equation are shown for 7 values of $\alpha$, represented in different colours.
  • Figure 3: Probabilistic Model: Bayesian Network structure (DAG) which encodes the experts' hypotheses of causal relationships among the variables collected in the survey to minors.
  • Figure 4: Histogram of the risk profiles ($\alpha_i$) parameters of the synthetic generated dataset ($N=500$ agents). Lower $\alpha$ values encode agents with lower risk propensity and vice versa.
  • Figure 5: Graphical representation of the hierarchical Bayesian model. Circular nodes represent continuous variables and square nodes discrete ones. Double-bordered nodes represent deterministic variables. Shaded nodes represent observed variables.
  • ...and 5 more figures