Table of Contents
Fetching ...

Building Audio-Visual Digital Twins with Smartphones

Zitong Lan, Yiwei Tang, Yuhan Wang, Haowen Lai, Yiduo Hao, Mingmin Zhao

TL;DR

AV-Twin addresses the gap in digital twins by enabling practical, editable audio-visual replicas built with commodity smartphones. It fuses smartphone-based, dynamic RIR capture with a visual ground and a visual-assisted acoustic field model, and it learns per-surface material properties via differentiable rendering to support scene edits. The system delivers measurable gains in data efficiency, rendering speed, and material-estimation accuracy, enabling immersive audio rendering, perceptual evaluation of edits, and improved acoustic localization. This work demonstrates a feasible path to fully modifiable AV digital twins for real-world environments on everyday devices.

Abstract

Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobile RIR capture and a visual-assisted acoustic field model to efficiently reconstruct room acoustics. It further recovers per-surface material properties through differentiable acoustic rendering, enabling users to modify materials, geometry, and layout while automatically updating both audio and visuals. Together, these capabilities establish a practical path toward fully modifiable audio-visual digital twins for real-world environments.

Building Audio-Visual Digital Twins with Smartphones

TL;DR

AV-Twin addresses the gap in digital twins by enabling practical, editable audio-visual replicas built with commodity smartphones. It fuses smartphone-based, dynamic RIR capture with a visual ground and a visual-assisted acoustic field model, and it learns per-surface material properties via differentiable rendering to support scene edits. The system delivers measurable gains in data efficiency, rendering speed, and material-estimation accuracy, enabling immersive audio rendering, perceptual evaluation of edits, and improved acoustic localization. This work demonstrates a feasible path to fully modifiable AV digital twins for real-world environments on everyday devices.

Abstract

Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobile RIR capture and a visual-assisted acoustic field model to efficiently reconstruct room acoustics. It further recovers per-surface material properties through differentiable acoustic rendering, enabling users to modify materials, geometry, and layout while automatically updating both audio and visuals. Together, these capabilities establish a practical path toward fully modifiable audio-visual digital twins for real-world environments.

Paper Structure

This paper contains 20 sections, 5 equations, 22 figures, 1 table.

Figures (22)

  • Figure 1: Users can easily reconstruct an audio-visual digital twin using only a pair of smartphones. AV-Twin further extends this to a modifiable audio-visual scene by estimating the material properties of each mesh and enabling both material and geometry edits.
  • Figure 2: AV-Twin builds audio-visual digital twin (§\ref{['sec:audio_visual_digital_twin']}) with a series of key innovations. It also enable modifiable audio-visual scene by estimating acoustic property (§\ref{['sec:acoustic_property_capture']}) to support various editing capabilities (§\ref{['sec:audio_visual_scene_editing']}). They enable practical applications demonstrated in the experiments.
  • Figure 3: Illustration of RIR extraction. (a) The received signal is cross-correlated with the reference chirp. Since (b) chirp's auto-correlation produces a delta-like peak, (c) the resulting output reveals the RIR, which consists of direct path, multipath reflections and late reverberations.
  • Figure 4: Illustration of acoustic two-way handshake design to simultaneously record the RIR and determine the correct ToF.
  • Figure 5: (a) Real-time chirp detection: Tx recording $x_\text{tx}$ is converted to baseband and down-sampled to accelerate calculation. It is then correlated with $c_1$ in the time-frequency domain for detection. (b) Chirp arrival time detection: Correlate the recording with the known $c_2$ in the time domain reveals the RIR and we identify the LOS peak (right).
  • ...and 17 more figures