Table of Contents
Fetching ...

Panorama Generation From NFoV Image Done Right

Dian Zheng, Cheng Zhang, Xiao-Ming Wu, Cao Li, Chengfei Lv, Jian-Fang Hu, Wei-Shi Zheng

TL;DR

This work tackles the problem of generating accurate 360-degree panoramas from NFoV images by exposing the distortion-evaluation gap in existing metrics and introducing Distort-CLIP, a distortion-aware evaluation model. Building on this, it proposes PanoDecouple, a decoupled diffusion framework with DistortNet for distortion guidance and ContentNet for content completion, linked through a disturbance-aware distortion map and an all-block condition-registration mechanism. A distortion correction loss leveraging Distort-CLIP further enforces distortion fidelity, enabling zero-shot generalization and strong performance with only 3K training samples. The approach achieves state-of-the-art image quality and distortion accuracy on SUN360 and Laval Indoor benchmarks and extends to practical tasks like text editing and text-to-panorama generation, indicating broad applicability in VR and 3D scene synthesis.

Abstract

Generating 360-degree panoramas from narrow field of view (NFoV) image is a promising computer vision task for Virtual Reality (VR) applications. Existing methods mostly assess the generated panoramas with InceptionNet or CLIP based metrics, which tend to perceive the image quality and is \textbf{not suitable for evaluating the distortion}. In this work, we first propose a distortion-specific CLIP, named Distort-CLIP to accurately evaluate the panorama distortion and discover the \textbf{``visual cheating''} phenomenon in previous works (\ie, tending to improve the visual results by sacrificing distortion accuracy). This phenomenon arises because prior methods employ a single network to learn the distinct panorama distortion and content completion at once, which leads the model to prioritize optimizing the latter. To address the phenomenon, we propose \textbf{PanoDecouple}, a decoupled diffusion model framework, which decouples the panorama generation into distortion guidance and content completion, aiming to generate panoramas with both accurate distortion and visual appeal. Specifically, we design a DistortNet for distortion guidance by imposing panorama-specific distortion prior and a modified condition registration mechanism; and a ContentNet for content completion by imposing perspective image information. Additionally, a distortion correction loss function with Distort-CLIP is introduced to constrain the distortion explicitly. The extensive experiments validate that PanoDecouple surpasses existing methods both in distortion and visual metrics.

Panorama Generation From NFoV Image Done Right

TL;DR

This work tackles the problem of generating accurate 360-degree panoramas from NFoV images by exposing the distortion-evaluation gap in existing metrics and introducing Distort-CLIP, a distortion-aware evaluation model. Building on this, it proposes PanoDecouple, a decoupled diffusion framework with DistortNet for distortion guidance and ContentNet for content completion, linked through a disturbance-aware distortion map and an all-block condition-registration mechanism. A distortion correction loss leveraging Distort-CLIP further enforces distortion fidelity, enabling zero-shot generalization and strong performance with only 3K training samples. The approach achieves state-of-the-art image quality and distortion accuracy on SUN360 and Laval Indoor benchmarks and extends to practical tasks like text editing and text-to-panorama generation, indicating broad applicability in VR and 3D scene synthesis.

Abstract

Generating 360-degree panoramas from narrow field of view (NFoV) image is a promising computer vision task for Virtual Reality (VR) applications. Existing methods mostly assess the generated panoramas with InceptionNet or CLIP based metrics, which tend to perceive the image quality and is \textbf{not suitable for evaluating the distortion}. In this work, we first propose a distortion-specific CLIP, named Distort-CLIP to accurately evaluate the panorama distortion and discover the \textbf{``visual cheating''} phenomenon in previous works (\ie, tending to improve the visual results by sacrificing distortion accuracy). This phenomenon arises because prior methods employ a single network to learn the distinct panorama distortion and content completion at once, which leads the model to prioritize optimizing the latter. To address the phenomenon, we propose \textbf{PanoDecouple}, a decoupled diffusion model framework, which decouples the panorama generation into distortion guidance and content completion, aiming to generate panoramas with both accurate distortion and visual appeal. Specifically, we design a DistortNet for distortion guidance by imposing panorama-specific distortion prior and a modified condition registration mechanism; and a ContentNet for content completion by imposing perspective image information. Additionally, a distortion correction loss function with Distort-CLIP is introduced to constrain the distortion explicitly. The extensive experiments validate that PanoDecouple surpasses existing methods both in distortion and visual metrics.

Paper Structure

This paper contains 22 sections, 16 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: The image quality and distortion accuracy of existing methods and ours by FID and Distort-FID (ours) respectively. We project two regions in panorama (signed in corresponding color) into perspective image to show the distortion accuracy of existing methods (i.e., no distortion and natural layout in perspective image means good results). Recent methods improve the image quality while significantly ruining the distortion. We named it "visual cheating" phenomenon. Zoom in for best view.
  • Figure 2: The training pipeline of our Distort-CLIP. The image features of three distortion types will do cosine similarity with themselves, and text features of three distortion types respectively. "-" means that the corresponding elements will not participate in the computation because it is meaningless. The boxes in blue mean the similarity of corresponding elements is 1, otherwise 0. Zoom in for best view.
  • Figure 3: The pipeline of the proposed PanoDecouple, a decoupled diffusion model. The DistortNet focuses on distortion guidance via the proposed distortion map. To make full use of the position-encoding-like distortion map, we modify the condition registration mechanism of ControlNet from the first block only to all the blocks. The ContentNet is devoted to content completion by imposing partial panorama image input and perspective information. The U-Net remains frozen, coordinating the information fusion between content completion and distortion guidance branches, while fully leveraging its powerful pre-trained knowledge. Note that we omit the text input of the DistortNet and U-Net for simplification while the one for ContentNet is replaced by perspective image embedding.
  • Figure 4: Qualitative comparison of panorama generation from NFoV image. We sequentially present the results on SUN360, Laval Indoor, and raw images (two images each). Zoom in for best view.
  • Figure 5: The quantitative results of text-panorama generation. Zoom in for best view.
  • ...and 6 more figures