Table of Contents
Fetching ...

PanoDiffusion: 360-degree Panorama Outpainting via Diffusion

Tianhao Wu, Chuanxia Zheng, Tat-Jen Cham

TL;DR

360° panorama outpainting from narrow-field RGB images is challenging due to large missing regions and wraparound consistency. The authors introduce PanoDiffusion, a bi-modal latent diffusion model trained on RGB-D panoramas that learns depth-aware structure and enables high-quality RGB-D panorama completion even when depth is not provided at inference. They integrate wraparound mechanisms—camera-rotation data augmentation and a two-end alignment strategy during diffusion—to enforce seamless 360° wraparound. A two-stage pipeline with RefineNet upscaling achieves 512×1024 panoramas, and experiments on Structured3D show state-of-the-art RGB-D outpainting across diverse mask types and robust depth synthesis. This approach advances 3D indoor scene reconstruction by producing semantically rich, spatially coherent panoramas without requiring depth input at test time.

Abstract

Generating complete 360-degree panoramas from narrow field of view images is ongoing research as omnidirectional RGB data is not readily available. Existing GAN-based approaches face some barriers to achieving higher quality output, and have poor generalization performance over different mask types. In this paper, we present our 360-degree indoor RGB-D panorama outpainting model using latent diffusion models (LDM), called PanoDiffusion. We introduce a new bi-modal latent diffusion structure that utilizes both RGB and depth panoramic data during training, which works surprisingly well to outpaint depth-free RGB images during inference. We further propose a novel technique of introducing progressive camera rotations during each diffusion denoising step, which leads to substantial improvement in achieving panorama wraparound consistency. Results show that our PanoDiffusion not only significantly outperforms state-of-the-art methods on RGB-D panorama outpainting by producing diverse well-structured results for different types of masks, but can also synthesize high-quality depth panoramas to provide realistic 3D indoor models.

PanoDiffusion: 360-degree Panorama Outpainting via Diffusion

TL;DR

360° panorama outpainting from narrow-field RGB images is challenging due to large missing regions and wraparound consistency. The authors introduce PanoDiffusion, a bi-modal latent diffusion model trained on RGB-D panoramas that learns depth-aware structure and enables high-quality RGB-D panorama completion even when depth is not provided at inference. They integrate wraparound mechanisms—camera-rotation data augmentation and a two-end alignment strategy during diffusion—to enforce seamless 360° wraparound. A two-stage pipeline with RefineNet upscaling achieves 512×1024 panoramas, and experiments on Structured3D show state-of-the-art RGB-D outpainting across diverse mask types and robust depth synthesis. This approach advances 3D indoor scene reconstruction by producing semantically rich, spatially coherent panoramas without requiring depth input at test time.

Abstract

Generating complete 360-degree panoramas from narrow field of view images is ongoing research as omnidirectional RGB data is not readily available. Existing GAN-based approaches face some barriers to achieving higher quality output, and have poor generalization performance over different mask types. In this paper, we present our 360-degree indoor RGB-D panorama outpainting model using latent diffusion models (LDM), called PanoDiffusion. We introduce a new bi-modal latent diffusion structure that utilizes both RGB and depth panoramic data during training, which works surprisingly well to outpaint depth-free RGB images during inference. We further propose a novel technique of introducing progressive camera rotations during each diffusion denoising step, which leads to substantial improvement in achieving panorama wraparound consistency. Results show that our PanoDiffusion not only significantly outperforms state-of-the-art methods on RGB-D panorama outpainting by producing diverse well-structured results for different types of masks, but can also synthesize high-quality depth panoramas to provide realistic 3D indoor models.
Paper Structure (25 sections, 4 equations, 8 figures, 3 tables)

This paper contains 25 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Example results of 360$^\circ$ Panorama Outpainting on various masks. Compared to BIPS oh2022bips and OmniDreamer akimoto2022diverse, our model not only effectively generates semantically meaningful content and plausible appearances with many objects, such as beds, sofas and TV's, but also provides multiple and diverse solutions for this ill-posed problem. (Masked regions are shown in blue for better visualization. Zoom in to see the details.)
  • Figure 2: The overall pipeline of our proposed PanoDiffusion method. (a) During training, the model is optimized for RGB-D panorama synthesis, without the mask. (b) During inference, however, the depth information is no longer needed for masked panorama outpainting. (c) Finally, a super-resolution model is implemented to further enhance the high-resolution outpainting. We only show the input/output of each stage and omit the details of circular shift and adding noise. Note that the VQ-based encoder-decoders are pre-trained in advance, and fixed in the rest of our framework.
  • Figure 3: Our LDM outpainting structure with camera rotation mechanism. During training (a), we randomly select a rotation angle to generate a new panorama for data augmentation. During inference (b), we sample the visible region from the encoded features (above) and the invisible part from the denoising output (below). The depth map is not needed, and is set to random noise. At each denoising step, we crop a 90°-equivalent area of the intermediate result from the right and stitch it to the left, denoted by the circle following $z_{t}^{mixed}$ --- this strongly improves wraparound consistency.
  • Figure 4: An example of our two-end alignment mechanism. During inference, we rotate the scene for 90° in each denoising step. Within a total of 200 sampling steps, our PanoDiffusion will effectively achieve wraparound consistency.
  • Figure 5: Examples of various mask types. See text for details.
  • ...and 3 more figures