L-MAGIC: Language Model Assisted Generation of Images with Coherence
Zhipeng Cai, Matthias Mueller, Reiner Birkl, Diana Wofk, Shao-Yen Tseng, JunDa Cheng, Gabriela Ben-Melech Stan, Vasudev Lal, Michael Paulitsch
TL;DR
L-MAGIC tackles the challenge of generating coherent 360° panoramas from a single image by using large language models to guide diffusion-based multi-view inpainting, producing consistent layouts across views without network fine-tuning. The method iteratively warps and inpaints perspective views, guided by BLIP-2 and ChatGPT to create view-specific descriptions and a global scene layout, with repetition avoidance and automatic prompts. Quality is enhanced via super-resolution and careful multi-view fusion, enabling high-resolution panoramas, depth estimation-backed 3D point clouds, and immersive video fly-throughs; it also supports multiple input modalities through conditional diffusion models. Empirically, L-MAGIC outperforms state-of-the-art baselines on image- and text-to-panorama tasks, with strong human preference (>70%), and demonstrates versatile applications from depth-based 3D reconstruction to anything-to-panorama generation. This work advances practical panoramic content creation for architecture, VR, and media by providing a robust, zero-shot pipeline that leverages existing diffusion and language models.
Abstract
In the current era of generative AI breakthroughs, generating panoramic scenes from a single input image remains a key challenge. Most existing methods use diffusion-based iterative or simultaneous multi-view inpainting. However, the lack of global scene layout priors leads to subpar outputs with duplicated objects (e.g., multiple beds in a bedroom) or requires time-consuming human text inputs for each view. We propose L-MAGIC, a novel method leveraging large language models for guidance while diffusing multiple coherent views of 360 degree panoramic scenes. L-MAGIC harnesses pre-trained diffusion and language models without fine-tuning, ensuring zero-shot performance. The output quality is further enhanced by super-resolution and multi-view fusion techniques. Extensive experiments demonstrate that the resulting panoramic scenes feature better scene layouts and perspective view rendering quality compared to related works, with >70% preference in human evaluations. Combined with conditional diffusion models, L-MAGIC can accept various input modalities, including but not limited to text, depth maps, sketches, and colored scripts. Applying depth estimation further enables 3D point cloud generation and dynamic scene exploration with fluid camera motion. Code is available at https://github.com/IntelLabs/MMPano. The video presentation is available at https://youtu.be/XDMNEzH4-Ec?list=PLG9Zyvu7iBa0-a7ccNLO8LjcVRAoMn57s.
