Panorama Generation From NFoV Image Done Right
Dian Zheng, Cheng Zhang, Xiao-Ming Wu, Cao Li, Chengfei Lv, Jian-Fang Hu, Wei-Shi Zheng
TL;DR
This work tackles the problem of generating accurate 360-degree panoramas from NFoV images by exposing the distortion-evaluation gap in existing metrics and introducing Distort-CLIP, a distortion-aware evaluation model. Building on this, it proposes PanoDecouple, a decoupled diffusion framework with DistortNet for distortion guidance and ContentNet for content completion, linked through a disturbance-aware distortion map and an all-block condition-registration mechanism. A distortion correction loss leveraging Distort-CLIP further enforces distortion fidelity, enabling zero-shot generalization and strong performance with only 3K training samples. The approach achieves state-of-the-art image quality and distortion accuracy on SUN360 and Laval Indoor benchmarks and extends to practical tasks like text editing and text-to-panorama generation, indicating broad applicability in VR and 3D scene synthesis.
Abstract
Generating 360-degree panoramas from narrow field of view (NFoV) image is a promising computer vision task for Virtual Reality (VR) applications. Existing methods mostly assess the generated panoramas with InceptionNet or CLIP based metrics, which tend to perceive the image quality and is \textbf{not suitable for evaluating the distortion}. In this work, we first propose a distortion-specific CLIP, named Distort-CLIP to accurately evaluate the panorama distortion and discover the \textbf{``visual cheating''} phenomenon in previous works (\ie, tending to improve the visual results by sacrificing distortion accuracy). This phenomenon arises because prior methods employ a single network to learn the distinct panorama distortion and content completion at once, which leads the model to prioritize optimizing the latter. To address the phenomenon, we propose \textbf{PanoDecouple}, a decoupled diffusion model framework, which decouples the panorama generation into distortion guidance and content completion, aiming to generate panoramas with both accurate distortion and visual appeal. Specifically, we design a DistortNet for distortion guidance by imposing panorama-specific distortion prior and a modified condition registration mechanism; and a ContentNet for content completion by imposing perspective image information. Additionally, a distortion correction loss function with Distort-CLIP is introduced to constrain the distortion explicitly. The extensive experiments validate that PanoDecouple surpasses existing methods both in distortion and visual metrics.
