Table of Contents
Fetching ...

MIMOSA: Human-AI Co-Creation of Computational Spatial Audio Effects on Videos

Zheng Ning, Zheng Zhang, Jerrick Ban, Kaiwen Jiang, Ruohong Gan, Yapeng Tian, Toby Jia-Jun Li

TL;DR

MIMOSA presents a human-AI co-creation system that enables amateur video creators to generate and manipulate spatial audio for videos with mono or stereo sound. It uses a multi-step audiovisual pipeline—object detection with depth estimation, sound separation with audio tagging, and real-time spatial rendering—coupled with an interactive UI featuring 2D/3D sound-source manipulation, a video panel, and an audio-properties panel. The system emphasizes interpretable intermediate outputs to support error discovery and repair, while also enabling creative augmentation beyond the model’s initial predictions. A combination of technical evaluations and a user study demonstrates that Mimosa improves immersion and provides expressive control, offering a practical, composer-friendly path to accessible spatial-audio content creation and integration with editing tools like Premiere Pro.

Abstract

Spatial audio offers more immersive video consumption experiences to viewers; however, creating and editing spatial audio often expensive and requires specialized equipment and skills, posing a high barrier for amateur video creators. We present MIMOSA, a human-AI co-creation tool that enables amateur users to computationally generate and manipulate spatial audio effects. For a video with only monaural or stereo audio, MIMOSA automatically grounds each sound source to the corresponding sounding object in the visual scene and enables users to further validate and fix the errors in the locations of sounding objects. Users can also augment the spatial audio effect by flexibly manipulating the sounding source positions and creatively customizing the audio effect. The design of MIMOSA exemplifies a human-AI collaboration approach that, instead of utilizing state-of art end-to-end "black-box" ML models, uses a multistep pipeline that aligns its interpretable intermediate results with the user's workflow. A lab user study with 15 participants demonstrates MIMOSA's usability, usefulness, expressiveness, and capability in creating immersive spatial audio effects in collaboration with users.

MIMOSA: Human-AI Co-Creation of Computational Spatial Audio Effects on Videos

TL;DR

MIMOSA presents a human-AI co-creation system that enables amateur video creators to generate and manipulate spatial audio for videos with mono or stereo sound. It uses a multi-step audiovisual pipeline—object detection with depth estimation, sound separation with audio tagging, and real-time spatial rendering—coupled with an interactive UI featuring 2D/3D sound-source manipulation, a video panel, and an audio-properties panel. The system emphasizes interpretable intermediate outputs to support error discovery and repair, while also enabling creative augmentation beyond the model’s initial predictions. A combination of technical evaluations and a user study demonstrates that Mimosa improves immersion and provides expressive control, offering a practical, composer-friendly path to accessible spatial-audio content creation and integration with editing tools like Premiere Pro.

Abstract

Spatial audio offers more immersive video consumption experiences to viewers; however, creating and editing spatial audio often expensive and requires specialized equipment and skills, posing a high barrier for amateur video creators. We present MIMOSA, a human-AI co-creation tool that enables amateur users to computationally generate and manipulate spatial audio effects. For a video with only monaural or stereo audio, MIMOSA automatically grounds each sound source to the corresponding sounding object in the visual scene and enables users to further validate and fix the errors in the locations of sounding objects. Users can also augment the spatial audio effect by flexibly manipulating the sounding source positions and creatively customizing the audio effect. The design of MIMOSA exemplifies a human-AI collaboration approach that, instead of utilizing state-of art end-to-end "black-box" ML models, uses a multistep pipeline that aligns its interpretable intermediate results with the user's workflow. A lab user study with 15 participants demonstrates MIMOSA's usability, usefulness, expressiveness, and capability in creating immersive spatial audio effects in collaboration with users.
Paper Structure (32 sections, 6 figures, 2 tables)

This paper contains 32 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Mimosa's human-AI collaborative audio spatialization pipeline. Users can validate and adjust the intermediate results in three ways. From left to right, 1: users can adjust the audio properties of each separated soundtrack; 2: users can manually fix the error in aligning the separated soundtrack to the visual object in the video; 3: users can customize the spatial effect for each sounding object by manipulating its corresponding visual position.
  • Figure 2: (A) Users can view the volume of each channel using volume indicators. (B) Users can select the audio output format among monaural, stereo, quadraphonic, and 5.1 channels. (C) Users can toggle whether to create spatial effects based on the model-predicted spatial position or their own. (D) Video control buttons, from left to right: previous video, previous second, play/pause, next second, next video. (E) 2D sound source manipulation, where users can adjust the spatial position of each sounding object by moving the corresponding dot.
  • Figure 3: (A) 3D sound source manipulation, where users can adjust the spatial position of each sounding object by moving the corresponding sphere. (B) Users can change the viewing point position using the camera object. (C) Users can choose to move or rotate objects, especially the camera object.
  • Figure 4: (A) Users can specify the audio-visual correspondence by changing the name of the sounding object. (B) Object color indicator allows users to change the color of dots (in 2D manipulation) or spheres (in 3D manipulation) representing sounding objects. (C) Numeric spatial position enables users to view and modify the numeric coordinates of sounding objects. (D) By clicking the icon, users can control the volume through a volume slider. (E) Sound waveform display that visualize the current audio.
  • Figure 5: The setup of the user study.
  • ...and 1 more figures