Mozualization: Crafting Music and Visual Representation with Multimodal AI
Wanfang Xu, Lixiang Zhao, Haiwen Song, Xinheng Song, Zhaolin Lu, Yu Liu, Min Chen, Eng Gee Lim, Lingyun Yu
TL;DR
Mozualization tackles the challenge of integrating multimodal emotional cues into music by combining text, images, and audio references into a cohesive composition. The approach fuses multimodal data sonification (text/image processing via BLIP and MusicLM), targeted mixing via MFCC features and beat tracking, and a visual stacked streamgraph to show how inputs influence the music, all within a user-friendly interface that supports real-time previews. The paper contributes a design-informed architecture, a three-component pipeline, and a user study with nine enthusiasts demonstrating high usability, engagement, and a meaningful interpretation of multimodal outputs, while identifying practical limitations and directions for refinement. This work advances accessible, personalized music creation and editing through multimodal interfaces, with potential applications in galleries, performances, and immersive environments, where users can co-create and understand the link between inputs and sound.
Abstract
In this work, we introduce Mozualization, a music generation and editing tool that creates multi-style embedded music by integrating diverse inputs, such as keywords, images, and sound clips (e.g., segments from various pieces of music or even a playful cat's meow). Our work is inspired by the ways people express their emotions -- writing mood-descriptive poems or articles, creating drawings with warm or cool tones, or listening to sad or uplifting music. Building on this concept, we developed a tool that transforms these emotional expressions into a cohesive and expressive song, allowing users to seamlessly incorporate their unique preferences and inspirations. To evaluate the tool and, more importantly, gather insights for its improvement, we conducted a user study involving nine music enthusiasts. The study assessed user experience, engagement, and the impact of interacting with and listening to the generated music.
