Table of Contents
Fetching ...

MarioGPT: Open-Ended Text2Level Generation through Large Language Models

Shyam Sudhakaran, Miguel González-Duque, Claire Glanois, Matthias Freiberger, Elias Najarro, Sebastian Risi

TL;DR

MarioGPT presents a prompt-conditioned Transformer approach to open-ended text-to-level generation for Super Mario Bros, showing strong playability and controllability by fusing a frozen BART-encoded prompt signal with a DistilGPT2-level predictor. The method enables text-guided level design and, when paired with novelty search, sustains diverse, playable content in an open-ended loop. Empirical results demonstrate high non-air tile accuracy (~93% training, ~91% validation), substantial playable rates (~88%), and meaningful prompt controllability, while also revealing memorization and prompt-following limitations. The work highlights a practical path toward flexible PCG systems that leverage large language models, with potential extensions through richer data and human-in-the-loop feedback.

Abstract

Procedural Content Generation (PCG) is a technique to generate complex and diverse environments in an automated way. However, while generating content with PCG methods is often straightforward, generating meaningful content that reflects specific intentions and constraints remains challenging. Furthermore, many PCG algorithms lack the ability to generate content in an open-ended manner. Recently, Large Language Models (LLMs) have shown to be incredibly effective in many diverse domains. These trained LLMs can be fine-tuned, re-using information and accelerating training for new tasks. Here, we introduce MarioGPT, a fine-tuned GPT2 model trained to generate tile-based game levels, in our case Super Mario Bros levels. MarioGPT can not only generate diverse levels, but can be text-prompted for controllable level generation, addressing one of the key challenges of current PCG techniques. As far as we know, MarioGPT is the first text-to-level model and combined with novelty search it enables the generation of diverse levels with varying play-style dynamics (i.e. player paths) and the open-ended discovery of an increasingly diverse range of content. Code available at https://github.com/shyamsn97/mario-gpt.

MarioGPT: Open-Ended Text2Level Generation through Large Language Models

TL;DR

MarioGPT presents a prompt-conditioned Transformer approach to open-ended text-to-level generation for Super Mario Bros, showing strong playability and controllability by fusing a frozen BART-encoded prompt signal with a DistilGPT2-level predictor. The method enables text-guided level design and, when paired with novelty search, sustains diverse, playable content in an open-ended loop. Empirical results demonstrate high non-air tile accuracy (~93% training, ~91% validation), substantial playable rates (~88%), and meaningful prompt controllability, while also revealing memorization and prompt-following limitations. The work highlights a practical path toward flexible PCG systems that leverage large language models, with potential extensions through richer data and human-in-the-loop feedback.

Abstract

Procedural Content Generation (PCG) is a technique to generate complex and diverse environments in an automated way. However, while generating content with PCG methods is often straightforward, generating meaningful content that reflects specific intentions and constraints remains challenging. Furthermore, many PCG algorithms lack the ability to generate content in an open-ended manner. Recently, Large Language Models (LLMs) have shown to be incredibly effective in many diverse domains. These trained LLMs can be fine-tuned, re-using information and accelerating training for new tasks. Here, we introduce MarioGPT, a fine-tuned GPT2 model trained to generate tile-based game levels, in our case Super Mario Bros levels. MarioGPT can not only generate diverse levels, but can be text-prompted for controllable level generation, addressing one of the key challenges of current PCG techniques. As far as we know, MarioGPT is the first text-to-level model and combined with novelty search it enables the generation of diverse levels with varying play-style dynamics (i.e. player paths) and the open-ended discovery of an increasingly diverse range of content. Code available at https://github.com/shyamsn97/mario-gpt.
Paper Structure (15 sections, 11 figures, 5 tables)

This paper contains 15 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: MarioGPT is able to successfully generate levels that follow the text prompt (a--e). Failure cases rarely happen: for example in (f) the model manages to generate many pipes and some blocks, but it still generates enemies even though it was prompted with "no enemies".
  • Figure 2: MarioGPT prediction pipeline. Our MarioGPT model is a finetuned version of the distilled GPT2 language model. Like GPT2, MarioGPT is trained to predict next token sequences. Levels are represented as strings, which are tokenized by a Byte-Pair Encoding, similar to the original GPT2 model. The level is split by columns and flattened into a single vector (or batch of vectors for multiple levels). To incorporate prompt information, we utilize a frozen text encoder in the form of a pretrained bidirectional LLM (BART), and output the average hidden states of the model's forward pass. This average hidden state is then used in the cross attention layers of the GPT2 architecture in combination with the actual level sequence being passed into the model.
  • Figure 3: Novelty search setup and MarioGPT mutation operators. A level is sampled from a set of top elites in the archive, mutated, and, if novel enough, added to the archive. The mutation process involves two main steps: $(1)$ Pick a random slice from the level and replace it with a new MarioGPT sample, using a random prompt. $(2)$ Inpaint the border region with MarioBert to preserve path consistency.
  • Figure 4: Novelty search behavior characteristic. Left: level, Right: smoothed moving average of generated path.
  • Figure 5: A* vs. MarioGPT generated paths. Levels with (a) minimum ($0.02$), (a) median ($0.89$) and (a) maximum ($11.0$) mean absolute error (MAE) between trajectory of actual A* agent (denoted as A), and model suggestion (denoted as P), as well as interesting hand-picked examples. Positions where both trajectories overlap are marked with *. Paths suggested by the model generally tend to have more airtime than the A* agent (d, e), likely due to game physics not being accounted for in the original path annotations of the training data.
  • ...and 6 more figures