Leveraging AI to Generate Audio for User-generated Content in Video Games
Thomas Marrinan, Pakeeza Akram, Oli Gurmessa, Anthony Shishkin
TL;DR
The paper tackles the challenge of adding dynamic, high-quality audio to user-generated content in video games by proposing AI-driven on-the-fly audio generation using text-to-audio and image-to-audio pipelines. It leverages Meta's AudioCraft components, MusicGen and AudioGen, to generate music and sound effects from descriptive prompts derived from user-created content or from image captions, enabling real-time feedback. Two 2D prototype games demonstrate the approach: one for environment background music and another for object sound effects, with latency under 4 seconds for short audio clips. The study discusses ethical considerations, confirms qualitative audio quality that aligns with game aesthetics, and outlines future directions including human-in-the-loop editing and broader training data to better emulate established game audio, highlighting practical implications for enhancing creativity and engagement in user-generated content ecosystems.
Abstract
In video game design, audio (both environmental background music and object sound effects) play a critical role. Sounds are typically pre-created assets designed for specific locations or objects in a game. However, user-generated content is becoming increasingly popular in modern games (e.g. building custom environments or crafting unique objects). Since the possibilities are virtually limitless, it is impossible for game creators to pre-create audio for user-generated content. We explore the use of generative artificial intelligence to create music and sound effects on-the-fly based on user-generated content. We investigate two avenues for audio generation: 1) text-to-audio: using a text description of user-generated content as input to the audio generator, and 2) image-to-audio: using a rendering of the created environment or object as input to an image-to-text generator, then piping the resulting text description into the audio generator. In this paper we discuss ethical implications of using generative artificial intelligence for user-generated content and highlight two prototype games where audio is generated for user-created environments and objects.
