Table of Contents
Fetching ...

CubeDiff: Repurposing Diffusion-Based Image Models for Panorama Generation

Nikolai Kalischek, Michael Oechsle, Fabian Manhardt, Philipp Henzler, Konrad Schindler, Federico Tombari

TL;DR

CubeDiff repurposes pretrained diffusion models to generate 360° panoramas by operating on cubemaps, treating each of the six faces as a perspective image and enabling cross-face coherence through inflated attention. Key innovations include synchronized GroupNorm across faces, cube-geometry positional encodings, overlapping face predictions, and classifier-free guidance, all trained on a diverse panorama corpus. Empirical results on Laval Indoor and SUN360 show state-of-the-art perceptual and text-alignment metrics, with strong generalization across text-only, image-only, and text+image conditioning, and an ability to perform fine-grained per-face text control. The approach achieves high-resolution, coherent panoramas with minimal architectural changes to existing diffusion models, offering practical impact for VR, gaming, and creative content generation.

Abstract

We introduce a novel method for generating 360° panoramas from text prompts or images. Our approach leverages recent advances in 3D generation by employing multi-view diffusion models to jointly synthesize the six faces of a cubemap. Unlike previous methods that rely on processing equirectangular projections or autoregressive generation, our method treats each face as a standard perspective image, simplifying the generation process and enabling the use of existing multi-view diffusion models. We demonstrate that these models can be adapted to produce high-quality cubemaps without requiring correspondence-aware attention layers. Our model allows for fine-grained text control, generates high resolution panorama images and generalizes well beyond its training set, whilst achieving state-of-the-art results, both qualitatively and quantitatively. Project page: https://cubediff.github.io/

CubeDiff: Repurposing Diffusion-Based Image Models for Panorama Generation

TL;DR

CubeDiff repurposes pretrained diffusion models to generate 360° panoramas by operating on cubemaps, treating each of the six faces as a perspective image and enabling cross-face coherence through inflated attention. Key innovations include synchronized GroupNorm across faces, cube-geometry positional encodings, overlapping face predictions, and classifier-free guidance, all trained on a diverse panorama corpus. Empirical results on Laval Indoor and SUN360 show state-of-the-art perceptual and text-alignment metrics, with strong generalization across text-only, image-only, and text+image conditioning, and an ability to perform fine-grained per-face text control. The approach achieves high-resolution, coherent panoramas with minimal architectural changes to existing diffusion models, offering practical impact for VR, gaming, and creative content generation.

Abstract

We introduce a novel method for generating 360° panoramas from text prompts or images. Our approach leverages recent advances in 3D generation by employing multi-view diffusion models to jointly synthesize the six faces of a cubemap. Unlike previous methods that rely on processing equirectangular projections or autoregressive generation, our method treats each face as a standard perspective image, simplifying the generation process and enabling the use of existing multi-view diffusion models. We demonstrate that these models can be adapted to produce high-quality cubemaps without requiring correspondence-aware attention layers. Our model allows for fine-grained text control, generates high resolution panorama images and generalizes well beyond its training set, whilst achieving state-of-the-art results, both qualitatively and quantitatively. Project page: https://cubediff.github.io/

Paper Structure

This paper contains 54 sections, 1 equation, 24 figures, 2 tables.

Figures (24)

  • Figure 1: CubeDiff leverages cubmaps to represent 360° panoramas and denoises all faces together in a single pass. In contrast to other works, Cubediff does not need to consider distortions, since it operatkes on common 90° FOV perspective images, maing it possible to directly utilize the internet-scale image prior of the underlying diffusion model.
  • Figure 2: An overview of our training pipeline and panorama model.(a) We project all training panoramas onto a cubmap and feed the faces to our frozen VAE encoder with synchronized GroupNorm to obtain the respective latents and enrich them with panorama-specific positional encodings for explicit spatial awareness. (b) We only train the inflated attention layers to be cross-frame aware.
  • Figure 3: Cubemaps and panoramas generated by CubeDiff with image and text condition. We depict a diverse set of generated panoramas including indoor, outdoors, bright and dark scenes. In all settings, CubeDiff produces high quality and realistic panoramas that align with the input image.
  • Figure 4: Qualitative comparison between CubeDiff and baselines on the LAVAL Indoor Dataset. Besides Text2Light, all panoramas are generated using the center face as input condition and additional text prompts if applicable. For each sample we show the panorama image as well as two projected images. Please zoom in to compare the different methods.
  • Figure 5: Fine-grained Text Control. We show an example for fine-grained-text control of the back face. Our model is able to change details following the provided prompt. First, we add a golden globe above the fireplace; second, we place a picture above the fireplace; third, we leave the space above empty; last, we instead add a bookshelf above it.
  • ...and 19 more figures