Table of Contents
Fetching ...

SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model

Tao Wu, Xuewei Li, Zhongang Qi, Di Hu, Xintao Wang, Ying Shan, Xi Li

TL;DR

A novel framework of SphereDiffusion is introduced to address the unique challenges of spherical distortion and geometry characteristics, for better generating high-quality and precisely controllable spherical panoramic images.

Abstract

Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains.However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation.In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images.For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images.Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion.For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic.Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images.With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average.

SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model

TL;DR

A novel framework of SphereDiffusion is introduced to address the unique challenges of spherical distortion and geometry characteristics, for better generating high-quality and precisely controllable spherical panoramic images.

Abstract

Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains.However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation.In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images.For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images.Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion.For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic.Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images.With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average.
Paper Structure (22 sections, 7 equations, 5 figures, 2 tables)

This paper contains 22 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The characteristics of spherical panoramic images and the impact of these characteristics on existing controllable generation methods.
  • Figure 2: Overall review of SphereDiffusion. (Upper left) Distortion-Resilient Semantic Encoding introduces category information into the representation of segmentation maps to alleviate the issue of text-image mismatch. (Upper right) Spherical SimSiam Contrastive Learning is a part of SGA Training, which constructs contrastive learning in the latent space, equipping SphereDiffusion with spherical geometry at the objective function level. (Lower left) Spherical Reprojection is a part of SGA Training at the data level, and Spherical Rotation serves as the foundation for SGA Training. (Lower middle) DDaB with deformable convolution enhances the model's perceptual ability of spherical distortion.
  • Figure 3: The processing of Spherical Geometry-aware Generation. During the generation process, we uniformly select K steps to rotate an angle $\alpha^{r}$ to enhance the boundary connectivity of the generated image.
  • Figure 4: Visualization comparison of comparing SphereDiffusion with ControlNet. The images generated by our SphereDiffusion are more closely aligned with the guidance provided by the segmentation maps and text prompts (highlighted by red line boxes and green dotted boxes). 'Overview' is generated image, and 'Boundary' displays the boundary of the generated image.
  • Figure 5: Visualization of generated image results with or without the Spherical Geometry-aware Generation. We use the same SphereDiffusion model, employing consistent text prompts, segmentation maps, and random seeds for generation. The first row shows images generated without incorporating SGA Generation, while the second row presents images generated with the inclusion of SGA Generation. 'Rotated Image' is obtained by rotating the generated 'Overview Image' by $\alpha = 180^{\circ}$.