Generative Powers of Ten
Xiaojuan Wang, Janne Kontkanen, Brian Curless, Steve Seitz, Ira Kemelmacher, Ben Mildenhall, Pratul Srinivasan, Dor Verbin, Aleksander Holynski
TL;DR
This work addresses semantic zoom by enabling text-conditioned content generation across multiple image scales with cross-scale consistency. It introduces a multi-scale joint diffusion sampling framework built around a zoom stack and a multi-resolution blending pipeline that uses Laplacian pyramids to fuse information across scales. A photograph-grounding variant and careful implementation on a cascaded diffusion model demonstrate the ability to produce Powers of Ten–style zoom videos with coherent content, outperforming autoregressive outpainting and progressive super-resolution baselines. The approach offers a flexible, text-driven tool for exploring multi-scale scenes and has potential applications in interactive visualization and cinematic generation where cross-scale coherence is essential.
Abstract
We present a method that uses a text-to-image model to generate consistent content across multiple image scales, enabling extreme semantic zooms into a scene, e.g., ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. We achieve this through a joint multi-scale diffusion sampling approach that encourages consistency across different scales while preserving the integrity of each individual sampling process. Since each generated scale is guided by a different text prompt, our method enables deeper levels of zoom than traditional super-resolution methods that may struggle to create new contextual structure at vastly different scales. We compare our method qualitatively with alternative techniques in image super-resolution and outpainting, and show that our method is most effective at generating consistent multi-scale content.
