Table of Contents
Fetching ...

Curved Diffusion: A Generative Model With Optical Geometry Control

Andrey Voynov, Amir Hertz, Moab Arar, Shlomi Fruchter, Daniel Cohen-Or

TL;DR

This work tackles the oversight in diffusion-based image synthesis where camera geometry is neglected. It introduces Curved Diffusion, a framework that injects arbitrary curved rendering geometry into a text-to-image diffusion model via per-pixel coordinate conditioning and, for broader surfaces, metric tensor conditioning. A self-attention reweighting scheme based on local warp density is proposed to maintain fidelity in warped regions. The approach enables controllable generation of lenses, photospheres, and spherical textures with a single model, and is supported by quantitative distortion fidelity metrics and human evaluations that attest to geometry-aware improvements. Together, these contributions broaden the practical applicability of diffusion models to VR, immersive visuals, and geometry-consistent texture synthesis.

Abstract

State-of-the-art diffusion models can generate highly realistic images based on various conditioning like text, segmentation, and depth. However, an essential aspect often overlooked is the specific camera geometry used during image capture. The influence of different optical systems on the final scene appearance is frequently overlooked. This study introduces a framework that intimately integrates a text-to-image diffusion model with the particular lens geometry used in image rendering. Our method is based on a per-pixel coordinate conditioning method, enabling the control over the rendering geometry. Notably, we demonstrate the manipulation of curvature properties, achieving diverse visual effects, such as fish-eye, panoramic views, and spherical texturing using a single diffusion model.

Curved Diffusion: A Generative Model With Optical Geometry Control

TL;DR

This work tackles the oversight in diffusion-based image synthesis where camera geometry is neglected. It introduces Curved Diffusion, a framework that injects arbitrary curved rendering geometry into a text-to-image diffusion model via per-pixel coordinate conditioning and, for broader surfaces, metric tensor conditioning. A self-attention reweighting scheme based on local warp density is proposed to maintain fidelity in warped regions. The approach enables controllable generation of lenses, photospheres, and spherical textures with a single model, and is supported by quantitative distortion fidelity metrics and human evaluations that attest to geometry-aware improvements. Together, these contributions broaden the practical applicability of diffusion models to VR, immersive visuals, and geometry-consistent texture synthesis.

Abstract

State-of-the-art diffusion models can generate highly realistic images based on various conditioning like text, segmentation, and depth. However, an essential aspect often overlooked is the specific camera geometry used during image capture. The influence of different optical systems on the final scene appearance is frequently overlooked. This study introduces a framework that intimately integrates a text-to-image diffusion model with the particular lens geometry used in image rendering. Our method is based on a per-pixel coordinate conditioning method, enabling the control over the rendering geometry. Notably, we demonstrate the manipulation of curvature properties, achieving diverse visual effects, such as fish-eye, panoramic views, and spherical texturing using a single diffusion model.
Paper Structure (28 sections, 6 equations, 23 figures)

This paper contains 28 sections, 6 equations, 23 figures.

Figures (23)

  • Figure 1: Images generated with different lens warps. Top row: unwarped 3d-sphere stereo panoramas; bottom row: fisheye lens, concave lens, sphere texturing.
  • Figure 2: Applying a lens geometry transformation in a post-process after the generation of the image produces low-quality image in highly expanded regions. Moreover, the corner regions which are behind the default image canvas are left uncovered.
  • Figure 3: The method training scheme. The training sample is processed with a random distortion, applied over both the image and normalized coordinates grid. Then, the warped image is noised with an additive noise $\varepsilon_t$ correspondent to the denoising step $t$. The denoising U-net model takes the noised image concatenated with the warping field and predicts the denoised image. All the self-attention layers weights are re-weighted according to the original image pixels density.
  • Figure 4: Normalized coordinates in the image. The coordinates unit square is set to be the centered maximum square crop.
  • Figure 5: Image generated without (top row), and with (bottom row) self-attention reweighting. For each of the samples we also depict its unwarping. No reweighting induces unproportional object parts (first and third examples), hallucinations in low-density regions (second example), and overall quality degradation.
  • ...and 18 more figures