Table of Contents
Fetching ...

Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech

Shuwei He, Rui Liu

TL;DR

The paper tackles immersive VTTS by leveraging multi-source spatial knowledge beyond RGB to generate reverberant speech aligned with environmental context. It proposes MS^2KU-VTTS, a four-component framework that fuses RGB, depth, speaker position, and Gemini-generated spatial captions through a Dominant-Supplement Serial Interaction and entropy-based Dynamic Fusion to guide ViT-TTS-based speech synthesis. Experimental results on the SoundSpaces-Speech dataset show significant improvements over state-of-the-art baselines in both perceptual quality and reverberation fidelity, especially in unseen environments. The work demonstrates that integrating diverse spatial cues and carefully designed interactions yields more natural, environment-consistent speech suitable for AR/VR settings, with code and demos publicly available.

Abstract

Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize reverberant speech for the spoken content. Previous works focus on the RGB modality for global environmental modeling, overlooking the potential of multi-source spatial knowledge like depth, speaker position, and environmental semantics. To address these issues, we propose a novel multi-source spatial knowledge understanding scheme for immersive VTTS, termed MS2KU-VTTS. Specifically, we first prioritize RGB image as the dominant source and consider depth image, speaker position knowledge from object detection, and Gemini-generated semantic captions as supplementary sources. Afterwards, we propose a serial interaction mechanism to effectively integrate both dominant and supplementary sources. The resulting multi-source knowledge is dynamically integrated based on the respective contributions of each source.This enriched interaction and integration of multi-source spatial knowledge guides the speech generation model, enhancing the immersive speech experience. Experimental results demonstrate that the MS$^2$KU-VTTS surpasses existing baselines in generating immersive speech. Demos and code are available at: https://github.com/AI-S2-Lab/MS2KU-VTTS.

Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech

TL;DR

The paper tackles immersive VTTS by leveraging multi-source spatial knowledge beyond RGB to generate reverberant speech aligned with environmental context. It proposes MS^2KU-VTTS, a four-component framework that fuses RGB, depth, speaker position, and Gemini-generated spatial captions through a Dominant-Supplement Serial Interaction and entropy-based Dynamic Fusion to guide ViT-TTS-based speech synthesis. Experimental results on the SoundSpaces-Speech dataset show significant improvements over state-of-the-art baselines in both perceptual quality and reverberation fidelity, especially in unseen environments. The work demonstrates that integrating diverse spatial cues and carefully designed interactions yields more natural, environment-consistent speech suitable for AR/VR settings, with code and demos publicly available.

Abstract

Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize reverberant speech for the spoken content. Previous works focus on the RGB modality for global environmental modeling, overlooking the potential of multi-source spatial knowledge like depth, speaker position, and environmental semantics. To address these issues, we propose a novel multi-source spatial knowledge understanding scheme for immersive VTTS, termed MS2KU-VTTS. Specifically, we first prioritize RGB image as the dominant source and consider depth image, speaker position knowledge from object detection, and Gemini-generated semantic captions as supplementary sources. Afterwards, we propose a serial interaction mechanism to effectively integrate both dominant and supplementary sources. The resulting multi-source knowledge is dynamically integrated based on the respective contributions of each source.This enriched interaction and integration of multi-source spatial knowledge guides the speech generation model, enhancing the immersive speech experience. Experimental results demonstrate that the MSKU-VTTS surpasses existing baselines in generating immersive speech. Demos and code are available at: https://github.com/AI-S2-Lab/MS2KU-VTTS.

Paper Structure

This paper contains 22 sections, 9 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The overall architecture of MS$^2$KU-VTTS.