Table of Contents
Fetching ...

Conditional Panoramic Image Generation via Masked Autoregressive Modeling

Chaoyang Wang, Xiangtai Li, Lu Qi, Xiaofan Lin, Jinbin Bai, Qianyu Zhou, Yunhai Tong

TL;DR

This work tackles the limitations of diffusion-based panorama generation by introducing PAR, a masked autoregressive framework that unifies text-to-panorama and panorama outpainting under ERP geometry. PAR circumvents the $i.i.d.$ noise constraint of diffusion models and integrates text and image conditioning within a single likelihood objective, aided by dual-space circular padding and a translation consistency loss to enforce ERP-aware coherence. Empirical results on Matterport3D show competitive or superior performance to specialist baselines, with notable gains in FID, FAED, DS, and outpainting quality, alongside strong zero-shot generalization. The approach offers a scalable, unified solution for panoramic content generation with broad practical impact for VR/AR, robotics, and visual navigation.

Abstract

Recent progress in panoramic image generation has underscored two critical limitations in existing approaches. First, most methods are built upon diffusion models, which are inherently ill-suited for equirectangular projection (ERP) panoramas due to the violation of the identically and independently distributed (i.i.d.) Gaussian noise assumption caused by their spherical mapping. Second, these methods often treat text-conditioned generation (text-to-panorama) and image-conditioned generation (panorama outpainting) as separate tasks, relying on distinct architectures and task-specific data. In this work, we propose a unified framework, Panoramic AutoRegressive model (PAR), which leverages masked autoregressive modeling to address these challenges. PAR avoids the i.i.d. assumption constraint and integrates text and image conditioning into a cohesive architecture, enabling seamless generation across tasks. To address the inherent discontinuity in existing generative models, we introduce circular padding to enhance spatial coherence and propose a consistency alignment strategy to improve generation quality. Extensive experiments demonstrate competitive performance in text-to-image generation and panorama outpainting tasks while showcasing promising scalability and generalization capabilities.

Conditional Panoramic Image Generation via Masked Autoregressive Modeling

TL;DR

This work tackles the limitations of diffusion-based panorama generation by introducing PAR, a masked autoregressive framework that unifies text-to-panorama and panorama outpainting under ERP geometry. PAR circumvents the noise constraint of diffusion models and integrates text and image conditioning within a single likelihood objective, aided by dual-space circular padding and a translation consistency loss to enforce ERP-aware coherence. Empirical results on Matterport3D show competitive or superior performance to specialist baselines, with notable gains in FID, FAED, DS, and outpainting quality, alongside strong zero-shot generalization. The approach offers a scalable, unified solution for panoramic content generation with broad practical impact for VR/AR, robotics, and visual navigation.

Abstract

Recent progress in panoramic image generation has underscored two critical limitations in existing approaches. First, most methods are built upon diffusion models, which are inherently ill-suited for equirectangular projection (ERP) panoramas due to the violation of the identically and independently distributed (i.i.d.) Gaussian noise assumption caused by their spherical mapping. Second, these methods often treat text-conditioned generation (text-to-panorama) and image-conditioned generation (panorama outpainting) as separate tasks, relying on distinct architectures and task-specific data. In this work, we propose a unified framework, Panoramic AutoRegressive model (PAR), which leverages masked autoregressive modeling to address these challenges. PAR avoids the i.i.d. assumption constraint and integrates text and image conditioning into a cohesive architecture, enabling seamless generation across tasks. To address the inherent discontinuity in existing generative models, we introduce circular padding to enhance spatial coherence and propose a consistency alignment strategy to improve generation quality. Extensive experiments demonstrate competitive performance in text-to-image generation and panorama outpainting tasks while showcasing promising scalability and generalization capabilities.

Paper Structure

This paper contains 20 sections, 19 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: Generated samples from Panoramic AutoRegressive (PAR) Model. PAR unifies several conditional panoramic image generation tasks, including text-to-panorama, panorama outpainting, and panoramic image editing.
  • Figure 2: Method illustration. (a) PAR utilizes a transformer $f$ to predict regions obscured by mask $M$, then employs these predictions as conditions to drive an MLP $\epsilon_\theta$ in generating continuous tokens. The VAE encoder $\mathcal{E}$ and decoder $\mathcal{D}$ are frozen. The dashed line indicates the inference phase. $c$ and $t$ denote textual embeddings and time-steps, respectively. (b) Both original and $\mathcal{T}_v$-augmented triples $(x, \epsilon, M)$ are processed by the same model, and then aligned through a consistency loss.
  • Figure 3: Visual comparisons with previous methods on text-to-panorama task. Previous methods neglect the circular consistency characteristics (yellow box), or suffer from repetitive objects (red box), artifact generation (blue box), and inconsistent combinations (green box). The highlighted portions in captions indicate text-image alignment failure cases of baseline methods.
  • Figure 4: Qualitative comparisons of panorama outpainting on the Matterport3D dataset. PAR-1.4B is used for this task, where PAR w/o prompt means the textual prompt is set as empty.
  • Figure 5: Scaling parameters and training compute improve fidelity and soundness. Two cases are shown with 3 model sizes and 3 different training stages. From top to bottom: 0.3B, 0.6B, 1.4B. From left to right: $25\%$, $50\%$, $100\%$ of the training process.
  • ...and 10 more figures