Leveraging AI to Generate Audio for User-generated Content in Video Games

Thomas Marrinan; Pakeeza Akram; Oli Gurmessa; Anthony Shishkin

Leveraging AI to Generate Audio for User-generated Content in Video Games

Thomas Marrinan, Pakeeza Akram, Oli Gurmessa, Anthony Shishkin

TL;DR

The paper tackles the challenge of adding dynamic, high-quality audio to user-generated content in video games by proposing AI-driven on-the-fly audio generation using text-to-audio and image-to-audio pipelines. It leverages Meta's AudioCraft components, MusicGen and AudioGen, to generate music and sound effects from descriptive prompts derived from user-created content or from image captions, enabling real-time feedback. Two 2D prototype games demonstrate the approach: one for environment background music and another for object sound effects, with latency under 4 seconds for short audio clips. The study discusses ethical considerations, confirms qualitative audio quality that aligns with game aesthetics, and outlines future directions including human-in-the-loop editing and broader training data to better emulate established game audio, highlighting practical implications for enhancing creativity and engagement in user-generated content ecosystems.

Abstract

In video game design, audio (both environmental background music and object sound effects) play a critical role. Sounds are typically pre-created assets designed for specific locations or objects in a game. However, user-generated content is becoming increasingly popular in modern games (e.g. building custom environments or crafting unique objects). Since the possibilities are virtually limitless, it is impossible for game creators to pre-create audio for user-generated content. We explore the use of generative artificial intelligence to create music and sound effects on-the-fly based on user-generated content. We investigate two avenues for audio generation: 1) text-to-audio: using a text description of user-generated content as input to the audio generator, and 2) image-to-audio: using a rendering of the created environment or object as input to an image-to-text generator, then piping the resulting text description into the audio generator. In this paper we discuss ethical implications of using generative artificial intelligence for user-generated content and highlight two prototype games where audio is generated for user-created environments and objects.

Leveraging AI to Generate Audio for User-generated Content in Video Games

TL;DR

Abstract

Paper Structure (7 sections, 3 figures)

This paper contains 7 sections, 3 figures.

Introduction
Ethical Considerations
Prototype Games
Game 1: User-Generated Environments
Game 2: User-Generated Objects
Discussion
Conclusion

Figures (3)

Figure 1: Three pre-created levels. In each level, the user controls a character to jump between platforms and reach the star. Each level has its own background music appropriate to its aesthetics (grasslands at twilight, tropical beach, and ominous cave).
Figure 2: Level editor. Users can select colors for making a background gradient, add a new platform, move an existing platform, change the color of an existing platform, and select the position of the goal.
Figure 3: Editing and testing a user-generated vehicle. Left panel shows the vehicle editor, where users can drag components around to create a custom vehicle. Right panel shows the completed vehicle attempting to cross the rough terrain.

Leveraging AI to Generate Audio for User-generated Content in Video Games

TL;DR

Abstract

Leveraging AI to Generate Audio for User-generated Content in Video Games

Authors

TL;DR

Abstract

Table of Contents

Figures (3)