Table of Contents
Fetching ...

Efficient 3D Reconstruction, Streaming and Visualization of Static and Dynamic Scene Parts for Multi-client Live-telepresence in Large-scale Environments

Leif Van Holland, Patrick Stotko, Stefan Krumpen, Reinhard Klein, Michael Weinmann

TL;DR

This paper tackles immersive 3D live-telepresence in large, dynamic environments using a single moving RGB-D camera. It introduces a hybrid volumetric representation that combines a voxel-based static scene (enriched with semantics and accumulated motion) with a dynamic-part point cloud, enabling separate yet synchronized streaming and VR visualization. The dynamic/static separation relies on frame-wise scores $S_k$ derived from instance segmentation, optical flow $F_k$, and odometry, with a robust accumulation mechanism $A_k$ and a dynamicity score $s'_k(i)$ guiding updates to the static model through SDF truncation and weight modulation. The proposed pipeline demonstrates near real-time VR exploration, robustness to network interruptions, and scalability to group-scale telepresence, with ablation studies validating key design choices and highlighting current limitations and future improvement directions.

Abstract

Despite the impressive progress of telepresence systems for room-scale scenes with static and dynamic scene entities, expanding their capabilities to scenarios with larger dynamic environments beyond a fixed size of a few square-meters remains challenging. In this paper, we aim at sharing 3D live-telepresence experiences in large-scale environments beyond room scale with both static and dynamic scene entities at practical bandwidth requirements only based on light-weight scene capture with a single moving consumer-grade RGB-D camera. To this end, we present a system which is built upon a novel hybrid volumetric scene representation in terms of the combination of a voxel-based scene representation for the static contents, that not only stores the reconstructed surface geometry but also contains information about the object semantics as well as their accumulated dynamic movement over time, and a point-cloud-based representation for dynamic scene parts, where the respective separation from static parts is achieved based on semantic and instance information extracted for the input frames. With an independent yet simultaneous streaming of both static and dynamic content, where we seamlessly integrate potentially moving but currently static scene entities in the static model until they are becoming dynamic again, as well as the fusion of static and dynamic data at the remote client, our system is able to achieve VR-based live-telepresence at close to real-time rates. Our evaluation demonstrates the potential of our novel approach in terms of visual quality, performance, and ablation studies regarding involved design choices.

Efficient 3D Reconstruction, Streaming and Visualization of Static and Dynamic Scene Parts for Multi-client Live-telepresence in Large-scale Environments

TL;DR

This paper tackles immersive 3D live-telepresence in large, dynamic environments using a single moving RGB-D camera. It introduces a hybrid volumetric representation that combines a voxel-based static scene (enriched with semantics and accumulated motion) with a dynamic-part point cloud, enabling separate yet synchronized streaming and VR visualization. The dynamic/static separation relies on frame-wise scores derived from instance segmentation, optical flow , and odometry, with a robust accumulation mechanism and a dynamicity score guiding updates to the static model through SDF truncation and weight modulation. The proposed pipeline demonstrates near real-time VR exploration, robustness to network interruptions, and scalability to group-scale telepresence, with ablation studies validating key design choices and highlighting current limitations and future improvement directions.

Abstract

Despite the impressive progress of telepresence systems for room-scale scenes with static and dynamic scene entities, expanding their capabilities to scenarios with larger dynamic environments beyond a fixed size of a few square-meters remains challenging. In this paper, we aim at sharing 3D live-telepresence experiences in large-scale environments beyond room scale with both static and dynamic scene entities at practical bandwidth requirements only based on light-weight scene capture with a single moving consumer-grade RGB-D camera. To this end, we present a system which is built upon a novel hybrid volumetric scene representation in terms of the combination of a voxel-based scene representation for the static contents, that not only stores the reconstructed surface geometry but also contains information about the object semantics as well as their accumulated dynamic movement over time, and a point-cloud-based representation for dynamic scene parts, where the respective separation from static parts is achieved based on semantic and instance information extracted for the input frames. With an independent yet simultaneous streaming of both static and dynamic content, where we seamlessly integrate potentially moving but currently static scene entities in the static model until they are becoming dynamic again, as well as the fusion of static and dynamic data at the remote client, our system is able to achieve VR-based live-telepresence at close to real-time rates. Our evaluation demonstrates the potential of our novel approach in terms of visual quality, performance, and ablation studies regarding involved design choices.
Paper Structure (22 sections, 2 equations, 5 figures, 2 tables)

This paper contains 22 sections, 2 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Visualization of the key components of our proposed pipeline. The color image is blended with class and instance information, and shown along with the optical flow with respect to the previous frame (first image). This information is integrated to produce a mask that segments the frame into static and dynamic regions (second image). Together with an accumulated 3D motion estimate (third image), the scene is streamed to one or multiple remote clients for immersive exploration in VR (fourth image). In this example, the user chose to view the accumulated 3D motion.
  • Figure 2: Visualization of different processing stages for the $k$-th RGB-D frame in the pipeline. Starting with color $I_k$ and depth $D_k$, instance segmentation $L_k$ (class labels) and $\iota_k$ (instance IDs), optical flow $F_k$ and odometry flow $\Psi_k$ (i.e., the flow generated from the estimated camera motion) are computed. Next, the end-point-errors (EPE) between the flows are computed, normalized and propagated using the instance segmentation to generate the dynamicity scores $S_k$. The scores are accumulated in $A_k$ and $L_k, \iota_k, S_k$ and $A_k$ are used to integrate information about static regions in the voxel block model. New static voxels and current dynamic regions are sent to the server, which forwards this information to the exploration clients appropriately.
  • Figure 3: Results of our approach on different scenes. Left to right: Input color image; resulting segmentation into static (blue) and dynamic (yellow) regions; the accumulated 3D flow magnitude; a novel view of the scene as visualized in the exploration client.
  • Figure 4: Comparison of design choices of the proposed pipeline. Top row: An example output from the exploration client using the standard voxel block weighting schema (left) vs. exponential weight decay via weight capping. The second approach yields a reconstruction of the box with fewer artifacts. Bottom row: Thresholding of the normalized EPE before (left) and after (right) propagation of the error modes into the static (blue) and dynamic (yellow) object masks. Again, the second approach produces a more plausible segmentation into static and dynamic regions.
  • Figure 5: Failure case of our method. Shown are RGB (top left), optical flow (top right), instance segmentation (bottom left) and resulting segmentation into static and dynamic (bottom right). Even though a clear motion cue is available in the optical flow image, due to a missing object detection, our method fails to correctly identify the dynamic region (orange circle).