The NES Video-Music Database: A Dataset of Symbolic Video Game Music Paired with Gameplay Videos
Igor Cardoso, Rubens O. Moraes, Lucas N. Ferreira
TL;DR
The paper introduces NES-VMDB, a large dataset pairing 98,940 NES gameplay clips with their original MIDI soundtracks to enable symbolic music generation from video data. It assembles the dataset by collecting long-play videos, segmenting into 15-second clips, and using a Shazam-like audio fingerprinting approach to map clips to synthesized MIDI pieces from the NES-MDB. A Controllable Music Transformer (CMT) baseline is trained to generate NES music conditioned on gameplay rhythm features, and its outputs are evaluated against an unconditional CMT and human pieces using objective music-structure metrics. A genre classifier tests whether the conditioned music captures genre signals, showing partial success in aligning generated pieces with game genres. Overall, the work demonstrates the feasibility of video-conditioned symbolic music generation for retro games and outlines clear avenues for improving modeling fidelity and genre understanding.
Abstract
Neural models are one of the most popular approaches for music generation, yet there aren't standard large datasets tailored for learning music directly from game data. To address this research gap, we introduce a novel dataset named NES-VMDB, containing 98,940 gameplay videos from 389 NES games, each paired with its original soundtrack in symbolic format (MIDI). NES-VMDB is built upon the Nintendo Entertainment System Music Database (NES-MDB), encompassing 5,278 music pieces from 397 NES games. Our approach involves collecting long-play videos for 389 games of the original dataset, slicing them into 15-second-long clips, and extracting the audio from each clip. Subsequently, we apply an audio fingerprinting algorithm (similar to Shazam) to automatically identify the corresponding piece in the NES-MDB dataset. Additionally, we introduce a baseline method based on the Controllable Music Transformer to generate NES music conditioned on gameplay clips. We evaluated this approach with objective metrics, and the results showed that the conditional CMT improves musical structural quality when compared to its unconditional counterpart. Moreover, we used a neural classifier to predict the game genre of the generated pieces. Results showed that the CMT generator can learn correlations between gameplay videos and game genres, but further research has to be conducted to achieve human-level performance.
