Table of Contents
Fetching ...

Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

Leif Van Holland, Domenic Zingsheim, Mana Takhsha, Hannah Dröge, Patrick Stotko, Markus Plack, Reinhard Klein

TL;DR

This work proposes to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering, and demonstrates that the model achieves the best trade-off between quality and speed.

Abstract

High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.

Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

TL;DR

This work proposes to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering, and demonstrates that the model achieves the best trade-off between quality and speed.

Abstract

High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.
Paper Structure (16 sections, 8 equations, 7 figures, 3 tables)

This paper contains 16 sections, 8 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Streamed content from a multi-camera setup (left) is prone to incomplete textures (center) because of missing information in the sparse viewpoints. To fix this, we propose a transformer-based inpainting method that efficiently incorporates information from the original images and thus surpasses traditional inpainting on the reconstruction alone (right).
  • Figure 2: Overview of the proposed transformer-based inpainting pipeline. The framework consists of two main stages: (1) Feature Encoding, where input and context images are encoded into feature representations and split into patches equipped with their spatio-temporal coordinates, (2) Context Aggregation and Decoding, utilizing a series of transformer groups and the contextual information to update the inpaint patches, that are finally decoded and blended with the known image regions. The flame symbol indicates a module with trainable parameters.
  • Figure 3: Feature maps are split into overlapping patches: background‐only patches are pruned, and object patches are kept as context $\mathcal{R}_t$ (blue). In the inpaint feature map, patches with missing pixels (green) constitute $\mathcal{P}_t$, while patches without missing pixels are added to the context. In this illustration, we show patches as non‐overlapping for clarity, even though they are overlapping in practice.
  • Figure 4: Visual comparison of our method against the pretrained multicam variants of the baseline methods. First column shows the input image from the reconstruction framework, last column shows the ground-truth view seen from the omitted camera.
  • Figure 5: Ablation study showing (from left to right): the input image, reconstruction using only a single camera view once without and once with masks, without leveraging past video frames, without Rotary Positional Encodings (RoPE), and the full proposed pipeline.
  • ...and 2 more figures