Table of Contents
Fetching ...

Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information

Christian Marinoni, Riccardo Fosco Gramaccioni, Eleonora Grassucci, Danilo Comminiello

TL;DR

Con360-AV tackles the problem of controllable audio-visual generation from immersive 360° environments by conditioning diffusion-based generators on three spatial cues: panoramic saliency, BASD maps defining the target viewpoint, and a global 360° scene caption. The method combines parallel audio and video diffusion nets with a dedicated Map Encoder and FiLM-based conditioning to ensure viewpoint-specific outputs are coherent with off-screen events, guided by the full 360° context via CMC-PE temporal synchronization. Evaluations on Sphere360 demonstrate enhanced spatial controllability and improved audiovisual coherence over a baseline, validating the approach for immersive media applications. This work enables realistic off-screen sound propagation and viewpoint-aware visuals, with potential extensions to multichannel spatial audio like Ambisonics for fully immersive experiences.

Abstract

The generation of sounding videos has seen significant advancements with the advent of diffusion models. However, existing methods often lack the fine-grained control needed to generate viewpoint-specific content from larger, immersive 360-degree environments. This limitation restricts the creation of audio-visual experiences that are aware of off-camera events. To the best of our knowledge, this is the first work to introduce a framework for controllable audio-visual generation, addressing this unexplored gap. Specifically, we propose a diffusion model by introducing a set of powerful conditioning signals derived from the full 360-degree space: a panoramic saliency map to identify regions of interest, a bounding-box-aware signed distance map to define the target viewpoint, and a descriptive caption of the entire scene. By integrating these controls, our model generates spatially-aware viewpoint videos and audios that are coherently influenced by the broader, unseen environmental context, introducing a strong controllability that is essential for realistic and immersive audio-visual generation. We show audiovisual examples proving the effectiveness of our framework.

Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information

TL;DR

Con360-AV tackles the problem of controllable audio-visual generation from immersive 360° environments by conditioning diffusion-based generators on three spatial cues: panoramic saliency, BASD maps defining the target viewpoint, and a global 360° scene caption. The method combines parallel audio and video diffusion nets with a dedicated Map Encoder and FiLM-based conditioning to ensure viewpoint-specific outputs are coherent with off-screen events, guided by the full 360° context via CMC-PE temporal synchronization. Evaluations on Sphere360 demonstrate enhanced spatial controllability and improved audiovisual coherence over a baseline, validating the approach for immersive media applications. This work enables realistic off-screen sound propagation and viewpoint-aware visuals, with potential extensions to multichannel spatial audio like Ambisonics for fully immersive experiences.

Abstract

The generation of sounding videos has seen significant advancements with the advent of diffusion models. However, existing methods often lack the fine-grained control needed to generate viewpoint-specific content from larger, immersive 360-degree environments. This limitation restricts the creation of audio-visual experiences that are aware of off-camera events. To the best of our knowledge, this is the first work to introduce a framework for controllable audio-visual generation, addressing this unexplored gap. Specifically, we propose a diffusion model by introducing a set of powerful conditioning signals derived from the full 360-degree space: a panoramic saliency map to identify regions of interest, a bounding-box-aware signed distance map to define the target viewpoint, and a descriptive caption of the entire scene. By integrating these controls, our model generates spatially-aware viewpoint videos and audios that are coherently influenced by the broader, unseen environmental context, introducing a strong controllability that is essential for realistic and immersive audio-visual generation. We show audiovisual examples proving the effectiveness of our framework.

Paper Structure

This paper contains 12 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: The Con360-AV generation process. The model generates a target viewpoint conditioned on three inputs: global 360° saliency maps, a scene-wide text caption, and viewpoint-specific BASD maps. The figure shows how two different outputs, Viewpoint 1 (green) and Viewpoint 2 (red), are generated from the same scene by providing their corresponding BASD maps.
  • Figure 2: Derivation of the three contextual conditionings from a 360° video. The pipeline generates a textual prompt and two visual prompts through parallel branches. (1) Spatially-aware Caption: For each of the six directional viewpoints (front, back, left, right, top, bottom), BLIP-2 generates a series of captions at different timestamps to capture the evolution of the scene. Llama-3.2 then synthesizes this full set of descriptions to reconstruct a single, dynamic caption summarizing the action across the entire 360° view over time. (2) Saliency Maps: In parallel, the full 360° video is processed by SalViT360 to generate a sequence of saliency maps that highlights the most visually important regions over time. (3) BASD Maps: Centroids from the saliency maps are used to define the target viewpoint and generate a corresponding BASD (Boundary-Aware Saliency Detection) map for structural guidance.
  • Figure 3: Example of 2 generated viewpoints of the same scene.