Table of Contents
Fetching ...

Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation

Zhuqiang Lu, Kun Hu, Chaoyue Wang, Lei Bai, Zhiyong Wang

TL;DR

AOG-Net tackles open-vocabulary 360-degree image generation by autoregressively outpainting incomplete panoramas conditioned on NFoV inputs and text prompts. It introduces a global-local omni-aware conditioning framework that fuses text, omni-visual cues, NFoV signals, and omni-geometry through cross-attention and diffusion priors. Experiments on Laval indoor and outdoor HDR datasets show state-of-the-art performance with strong semantic alignment to prompts and robust generalization to unseen content, even with limited training data. The approach enables flexible, prompt-driven 360-degree content creation for VR/AR applications, with potential extensions to video and faster inference in the future.

Abstract

A 360-degree (omni-directional) image provides an all-encompassing spherical view of a scene. Recently, there has been an increasing interest in synthesising 360-degree images from conventional narrow field of view (NFoV) images captured by digital cameras and smartphones, for providing immersive experiences in various scenarios such as virtual reality. Yet, existing methods typically fall short in synthesizing intricate visual details or ensure the generated images align consistently with user-provided prompts. In this study, autoregressive omni-aware generative network (AOG-Net) is proposed for 360-degree image generation by out-painting an incomplete 360-degree image progressively with NFoV and text guidances joinly or individually. This autoregressive scheme not only allows for deriving finer-grained and text-consistent patterns by dynamically generating and adjusting the process but also offers users greater flexibility to edit their conditions throughout the generation process. A global-local conditioning mechanism is devised to comprehensively formulate the outpainting guidance in each autoregressive step. Text guidances, omni-visual cues, NFoV inputs and omni-geometry are encoded and further formulated with cross-attention based transformers into a global stream and a local stream into a conditioned generative backbone model. As AOG-Net is compatible to leverage large-scale models for the conditional encoder and the generative prior, it enables the generation to use extensive open-vocabulary text guidances. Comprehensive experiments on two commonly used 360-degree image datasets for both indoor and outdoor settings demonstrate the state-of-the-art performance of our proposed method. Our code will be made publicly available.

Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation

TL;DR

AOG-Net tackles open-vocabulary 360-degree image generation by autoregressively outpainting incomplete panoramas conditioned on NFoV inputs and text prompts. It introduces a global-local omni-aware conditioning framework that fuses text, omni-visual cues, NFoV signals, and omni-geometry through cross-attention and diffusion priors. Experiments on Laval indoor and outdoor HDR datasets show state-of-the-art performance with strong semantic alignment to prompts and robust generalization to unseen content, even with limited training data. The approach enables flexible, prompt-driven 360-degree content creation for VR/AR applications, with potential extensions to video and faster inference in the future.

Abstract

A 360-degree (omni-directional) image provides an all-encompassing spherical view of a scene. Recently, there has been an increasing interest in synthesising 360-degree images from conventional narrow field of view (NFoV) images captured by digital cameras and smartphones, for providing immersive experiences in various scenarios such as virtual reality. Yet, existing methods typically fall short in synthesizing intricate visual details or ensure the generated images align consistently with user-provided prompts. In this study, autoregressive omni-aware generative network (AOG-Net) is proposed for 360-degree image generation by out-painting an incomplete 360-degree image progressively with NFoV and text guidances joinly or individually. This autoregressive scheme not only allows for deriving finer-grained and text-consistent patterns by dynamically generating and adjusting the process but also offers users greater flexibility to edit their conditions throughout the generation process. A global-local conditioning mechanism is devised to comprehensively formulate the outpainting guidance in each autoregressive step. Text guidances, omni-visual cues, NFoV inputs and omni-geometry are encoded and further formulated with cross-attention based transformers into a global stream and a local stream into a conditioned generative backbone model. As AOG-Net is compatible to leverage large-scale models for the conditional encoder and the generative prior, it enables the generation to use extensive open-vocabulary text guidances. Comprehensive experiments on two commonly used 360-degree image datasets for both indoor and outdoor settings demonstrate the state-of-the-art performance of our proposed method. Our code will be made publicly available.
Paper Structure (18 sections, 5 equations, 7 figures, 3 tables)

This paper contains 18 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Examples of 360-degree image generation, showcasing the limitation of existing methods compared to ours. The top part above the dashed line depicts an NFoV-guided example and the bottom part below the dashed line is for a text-guided example. (a) Input condition. (b) Ours (AOG-Net). (c) Top - OmniDreamer akimoto2022diverse and Bottom - Text2Lightchen2023text2light.
  • Figure 2: Illustration of the proposed AOG-Net architecture.
  • Figure 3: Different representations of a 360-degree image. (a) Equirectangular projection. (b) Spherical representation. (c) Cubemap projection. (d) A spherical representation with geometry coordinates. (e) Geometry projection on cubemap.
  • Figure 4: Comparison between Text2Light for indoor and outdoor settings. (a) Ours. (b) Text2Light.
  • Figure 5: Qualitative examples with input and ground truth. (a) Input, $90^{\circ}$ in both longitude and latitude direction. (b) grount truth. (c) OmniDreamer. (d) AOG-Net (Ours).
  • ...and 2 more figures