Table of Contents
Fetching ...

CamFreeDiff: Camera-free Image to Panorama Generation with Diffusion Model

Xiaoding Yuan, Shitao Tang, Kejie Li, Alan Yuille, Peng Wang

TL;DR

This work addresses 360-degree panorama outpainting from a single camera-free image when camera intrinsics are unknown. It introduces CamFreeDiff, a diffusion-based pipeline that jointly learns a $3$-DOF homography $(f,\phi,\psi)$ mapping the input view to a predefined canonical view, thereby establishing pixel-level correspondences for eight target views. A frozen Stable Diffusion encoder supports text-guided generation, while an MLP-based homography estimator enables end-to-end differentiable integration with a correspondence-aware attention mechanism across views. Experiments on Matterport3D and the out-of-domain Structured3D demonstrate strong robustness to camera-free inputs and superior generalization, with the new-view variant delivering the best quality.

Abstract

This paper introduces Camera-free Diffusion (CamFreeDiff) model for 360-degree image outpainting from a single camera-free image and text description. This method distinguishes itself from existing strategies, such as MVDiffusion, by eliminating the requirement for predefined camera poses. Instead, our model incorporates a mechanism for predicting homography directly within the multi-view diffusion framework. The core of our approach is to formulate camera estimation by predicting the homography transformation from the input view to a predefined canonical view. The homography provides point-level correspondences between the input image and targeting panoramic images, allowing connections enforced by correspondence-aware attention in a fully differentiable manner. Qualitative and quantitative experimental results demonstrate our model's strong robustness and generalization ability for 360-degree image outpainting in the challenging context of camera-free inputs.

CamFreeDiff: Camera-free Image to Panorama Generation with Diffusion Model

TL;DR

This work addresses 360-degree panorama outpainting from a single camera-free image when camera intrinsics are unknown. It introduces CamFreeDiff, a diffusion-based pipeline that jointly learns a -DOF homography mapping the input view to a predefined canonical view, thereby establishing pixel-level correspondences for eight target views. A frozen Stable Diffusion encoder supports text-guided generation, while an MLP-based homography estimator enables end-to-end differentiable integration with a correspondence-aware attention mechanism across views. Experiments on Matterport3D and the out-of-domain Structured3D demonstrate strong robustness to camera-free inputs and superior generalization, with the new-view variant delivering the best quality.

Abstract

This paper introduces Camera-free Diffusion (CamFreeDiff) model for 360-degree image outpainting from a single camera-free image and text description. This method distinguishes itself from existing strategies, such as MVDiffusion, by eliminating the requirement for predefined camera poses. Instead, our model incorporates a mechanism for predicting homography directly within the multi-view diffusion framework. The core of our approach is to formulate camera estimation by predicting the homography transformation from the input view to a predefined canonical view. The homography provides point-level correspondences between the input image and targeting panoramic images, allowing connections enforced by correspondence-aware attention in a fully differentiable manner. Qualitative and quantitative experimental results demonstrate our model's strong robustness and generalization ability for 360-degree image outpainting in the challenging context of camera-free inputs.
Paper Structure (18 sections, 4 equations, 8 figures, 5 tables)

This paper contains 18 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Camera-free image-to-panorama generation: our method generates panoramas with reasonable layout and canonical viewpoint with unknown input image camera parameters.
  • Figure 2: In the context of panorama generation, we define the camera parameter for a canonical perspective view as $fov=90^\circ$, $x$-axis rotation $\phi=0$ and $z$-axis rotation $\psi=0$. The $y$-axis rotation can be any $\theta$ from $0^\circ$ to $360^\circ$.
  • Figure 3: We formulate the camera parameter estimation as estimating the homography matrix from the input view to a predefined canonical view of the scene. We define the canonical view as the perspective view with an absolute rotation angle of 0$^{\circ}$. We use a 3-DoF parameterization of the homography matrix instead of standard 8-DoF (details in \ref{['sec:estimate_homography']}).
  • Figure 4: Different strategies to generate panorama from a camera-free input. After estimating the homography matrix $H$ from the input view to a predefined canonical view, alternatives are: (a) Rectify the input view by unwarping and replacing the original input image. (b) Rectify input view by unwarping and replacing the input latent after image encoding. (c) Provide point-level correspondences to the multi-view generation model to enforce consistency between corresponding points.
  • Figure 5: Our panorama generation pipeline based on multi-view diffusion denoising model. With the predicted homography matrix from the input view to a predefined canonical view, point-wise information can be aggregated from the input view to all target canonical views through correspondence-aware attention(CAA). Note that this figure only shows the cross-attention between one group of corresponding points for clear visualization. However, the same process is applied to all groups of corresponding points.
  • ...and 3 more figures