Turning Text and Imagery into Captivating Visual Video
Mingming Wang, Elijah Miller
TL;DR
The paper addresses the challenge of efficiently visualizing architectural designs from multiple perspectives. It introduces a diffusion-based framework, akin to Stable Video Diffusion, to produce coherent multi-view videos from a single image and to generate design videos from textual descriptions, with a temporal UNet backbone and camera-motion control. Key contributions include a viewpoint-aware preprocessing stage, a multi-step diffusion process for cross-view consistency, and architectural-data–specific fine-tuning to support rapid prototyping and immersive virtual presentations. The work demonstrates potential to accelerate design visualization, enhance stakeholder communication, and integrate AI-assisted visualization into design workflows, while noting ethical considerations and the need for generalization across diverse architectural styles.
Abstract
The ability to visualize a structure from multiple perspectives is crucial for comprehensive planning and presentation. This paper introduces an advanced application of generative models, akin to Stable Video Diffusion, tailored for architectural visualization. We explore the potential of these models to create consistent multi-perspective videos of buildings from single images and to generate design videos directly from textual descriptions. The proposed method enhances the design process by offering rapid prototyping, cost and time efficiency, and an enriched creative space for architects and designers. By harnessing the power of AI, our approach not only accelerates the visualization of architectural concepts but also enables a more interactive and immersive experience for clients and stakeholders. This advancement in architectural visualization represents a significant leap forward, allowing for a deeper exploration of design possibilities and a more effective communication of complex architectural ideas.
