Table of Contents
Fetching ...

MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

Tianyu Xu, Sieun Kim, Qianhui Zheng, Ruoyu Xu, Tejasvi Ravi, Anuva Kulkarni, Katrina Passarella-Ward, Junyi Zhu, Adarsh Kowdle

TL;DR

Empirical results indicate that MoXaRt significantly enhances speech intelligibility, yielding a 36.2% increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load, thereby paving the way for more perceptive and socially adept XR experiences.

Abstract

In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p < 0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p < 0.001), thereby paving the way for more perceptive and socially adept XR experiences.

MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

TL;DR

Empirical results indicate that MoXaRt significantly enhances speech intelligibility, yielding a 36.2% increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load, thereby paving the way for more perceptive and socially adept XR experiences.

Abstract

In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p < 0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p < 0.001), thereby paving the way for more perceptive and socially adept XR experiences.
Paper Structure (64 sections, 4 equations, 9 figures, 2 tables)

This paper contains 64 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The user interface of MoXaRt and walkthrough of its internal implementation. (a) The user can identify separated sound sources through visualizations and adjust the volume of each source for remixing. (b) Video captured from the Quest 3 headset is streamed to the PC for model inference, and the remixed audio is streamed back. (c) To ensure continuous playback of remixed audio, MoXaRt operates in a three-stage pipeline: Video and Audio Capture, Model Inference, and Playback.
  • Figure 2: The cascaded architecture of MoXaRt for multi-modal sound separation. The system pipeline consists of two main stages: First, a Coarse Sound Separation module processes the raw audio inputs to produce initial coarse speech and music mixtures. Next, these mixtures, along with outputs from the Face Detection Network and Instrument Detection Network, are fed into two parallel refinement branches: Speech Refinement and Music Refinement. The Speech Refinement branch further disambiguates the coarse speech into individual speaker tracks and their corresponding spatial coordinates. The Music Refinement branch utilizes an ensemble of Band Split Roformer (BSR) models, dynamically activated by the detected instruments, to output isolated instrument tracks. This structured approach enables precise source separation by leveraging visual context only when specialized refinement is required.
  • Figure 3: Interactive Music Experiences. MoXaRt enables users to actively control their listening experience during live music events. (a) The user can act as a real-time audio engineer, selectively adjusting the gain of individual instruments, such as the violin, cello, and piano in a trio—to create a personalized mix. (b) In a concert setting, the user can enhance acoustic focus by isolating the musicians' performance from distracting audience chatter.
  • Figure 4: Enhanced Social Interaction. MoXaRt facilitates communication in complex and noisy social settings. (a) In an environment with multiple simultaneous conversations, a user can selectively amplify a target conversation to maintain focus. (b) The system can improve speech intelligibility in chaotic environments by isolating a conversation partner's voice from loud background noise, such as live music.
  • Figure 5: Downstream AI Assistance. The separated streams from MoXaRt can serve as clean, audio-visually aligned input for LLM assistance (e.g., Gemini, Chatgpt.) (a) The system can isolate individual speakers in a multi-lingual environment, enabling real-time, multi-person transcription and translation. (b) The system's output could be used to create a queryable log of interactions, allowing a user to ask questions about past events (e.g., "What did Peter say about this evening?") and receive a contextually grounded response.
  • ...and 4 more figures