L-MAGIC: Language Model Assisted Generation of Images with Coherence

Zhipeng Cai; Matthias Mueller; Reiner Birkl; Diana Wofk; Shao-Yen Tseng; JunDa Cheng; Gabriela Ben-Melech Stan; Vasudev Lal; Michael Paulitsch

L-MAGIC: Language Model Assisted Generation of Images with Coherence

Zhipeng Cai, Matthias Mueller, Reiner Birkl, Diana Wofk, Shao-Yen Tseng, JunDa Cheng, Gabriela Ben-Melech Stan, Vasudev Lal, Michael Paulitsch

TL;DR

L-MAGIC tackles the challenge of generating coherent 360° panoramas from a single image by using large language models to guide diffusion-based multi-view inpainting, producing consistent layouts across views without network fine-tuning. The method iteratively warps and inpaints perspective views, guided by BLIP-2 and ChatGPT to create view-specific descriptions and a global scene layout, with repetition avoidance and automatic prompts. Quality is enhanced via super-resolution and careful multi-view fusion, enabling high-resolution panoramas, depth estimation-backed 3D point clouds, and immersive video fly-throughs; it also supports multiple input modalities through conditional diffusion models. Empirically, L-MAGIC outperforms state-of-the-art baselines on image- and text-to-panorama tasks, with strong human preference (>70%), and demonstrates versatile applications from depth-based 3D reconstruction to anything-to-panorama generation. This work advances practical panoramic content creation for architecture, VR, and media by providing a robust, zero-shot pipeline that leverages existing diffusion and language models.

Abstract

In the current era of generative AI breakthroughs, generating panoramic scenes from a single input image remains a key challenge. Most existing methods use diffusion-based iterative or simultaneous multi-view inpainting. However, the lack of global scene layout priors leads to subpar outputs with duplicated objects (e.g., multiple beds in a bedroom) or requires time-consuming human text inputs for each view. We propose L-MAGIC, a novel method leveraging large language models for guidance while diffusing multiple coherent views of 360 degree panoramic scenes. L-MAGIC harnesses pre-trained diffusion and language models without fine-tuning, ensuring zero-shot performance. The output quality is further enhanced by super-resolution and multi-view fusion techniques. Extensive experiments demonstrate that the resulting panoramic scenes feature better scene layouts and perspective view rendering quality compared to related works, with >70% preference in human evaluations. Combined with conditional diffusion models, L-MAGIC can accept various input modalities, including but not limited to text, depth maps, sketches, and colored scripts. Applying depth estimation further enables 3D point cloud generation and dynamic scene exploration with fluid camera motion. Code is available at https://github.com/IntelLabs/MMPano. The video presentation is available at https://youtu.be/XDMNEzH4-Ec?list=PLG9Zyvu7iBa0-a7ccNLO8LjcVRAoMn57s.

L-MAGIC: Language Model Assisted Generation of Images with Coherence

TL;DR

Abstract

Paper Structure (21 sections, 2 equations, 16 figures, 1 algorithm)

This paper contains 21 sections, 2 equations, 16 figures, 1 algorithm.

Introduction
Related Work
Methdology
Warping
Inpainting with Language Model Assistance
Quality and Resolution Enhancement
Discussion
Experiments
Experimental Setup
Main Results
Analysis
Applications
Conclusion
L-MAGIC Prompts
Blur During Warping
...and 6 more sections

Figures (16)

Figure 1: Teaser. L-MAGIC is a novel method to generate a $360^\circ$ panoramic scene from a single input image. L-MAGIC utilizes large language models to control perspective diffusion models to generate multiple views with coherent $360^\circ$ layout. L-MAGIC is also compatible with images synthesized by conditional generative models, making it capable of creating panoramic scenes from various input modalities. A set of perspective images rather than a single panoramic image also allows our method to leverage off-the-shelf monocular depth estimation models to enable immersive experiences, e.g., scene fly-through or 3D point cloud generation.
Figure 2: L-MAGIC pipeline. The input is an image ${\mathcal{I}}$ either captured in the real-world or synthesized, e.g., by conditional diffusion models. Multiple novel views to compose a $360^\circ$ panoramic scene are generated by iterative warping and inpainting. Pre-trained diffusion models assisted by pre-trained language models are used to generate views with both high-quality local textures and coherent $360^\circ$ layouts. Further quality enhancement techniques ensure smooth blending of multiple views into high-resolution panoramic scenes. L-MAGIC can generate panorama images, immersive videos, and 3D point clouds from various types of inputs, such as images, text, and sketch drawings.
Figure 3: Quantitative results for image-to-panorama and text-to-panorama. (a) Human evaluations. Each baseline has two bars representing respectively the quality of rendered perspective views and the $360^\circ$ layout. The value of the bar means the frequency where our method is preferred in the voting. Above $50\%$ (dashed line) means our method is more preferred than the corresponding baseline. (b) Algorithmic evaluation by computing the Inception Score (IS). L-MAGIC consistently outperforms previous methods on both metrics.
Figure 4: Image-to-panorama visualizations. Stable Diffusion v2 cannot close the $360^\circ$ loop (sharp boundaries at the middle). Text2room and MVDiffusion lack mechanisms to avoid duplicate objects. L-MAGIC outputs have high local view quality and coherent scene layouts.
Figure 5: Text-to-panorama visualizations. Text2light, Stable Diffusion v2 and LDM3D cannot close the $360^\circ$ loop (sharp boundaries at the middle). Text2room and MVDiffusion generate panoramas with duplicate objects. L-MAGIC effectively addresses these problems, resulting in high-quality panorama with reasonable scene layouts.
...and 11 more figures

L-MAGIC: Language Model Assisted Generation of Images with Coherence

TL;DR

Abstract

L-MAGIC: Language Model Assisted Generation of Images with Coherence

Authors

TL;DR

Abstract

Table of Contents

Figures (16)