Table of Contents
Fetching ...

Generative Powers of Ten

Xiaojuan Wang, Janne Kontkanen, Brian Curless, Steve Seitz, Ira Kemelmacher, Ben Mildenhall, Pratul Srinivasan, Dor Verbin, Aleksander Holynski

TL;DR

This work addresses semantic zoom by enabling text-conditioned content generation across multiple image scales with cross-scale consistency. It introduces a multi-scale joint diffusion sampling framework built around a zoom stack and a multi-resolution blending pipeline that uses Laplacian pyramids to fuse information across scales. A photograph-grounding variant and careful implementation on a cascaded diffusion model demonstrate the ability to produce Powers of Ten–style zoom videos with coherent content, outperforming autoregressive outpainting and progressive super-resolution baselines. The approach offers a flexible, text-driven tool for exploring multi-scale scenes and has potential applications in interactive visualization and cinematic generation where cross-scale coherence is essential.

Abstract

We present a method that uses a text-to-image model to generate consistent content across multiple image scales, enabling extreme semantic zooms into a scene, e.g., ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. We achieve this through a joint multi-scale diffusion sampling approach that encourages consistency across different scales while preserving the integrity of each individual sampling process. Since each generated scale is guided by a different text prompt, our method enables deeper levels of zoom than traditional super-resolution methods that may struggle to create new contextual structure at vastly different scales. We compare our method qualitatively with alternative techniques in image super-resolution and outpainting, and show that our method is most effective at generating consistent multi-scale content.

Generative Powers of Ten

TL;DR

This work addresses semantic zoom by enabling text-conditioned content generation across multiple image scales with cross-scale consistency. It introduces a multi-scale joint diffusion sampling framework built around a zoom stack and a multi-resolution blending pipeline that uses Laplacian pyramids to fuse information across scales. A photograph-grounding variant and careful implementation on a cascaded diffusion model demonstrate the ability to produce Powers of Ten–style zoom videos with coherent content, outperforming autoregressive outpainting and progressive super-resolution baselines. The approach offers a flexible, text-driven tool for exploring multi-scale scenes and has potential applications in interactive visualization and cinematic generation where cross-scale coherence is essential.

Abstract

We present a method that uses a text-to-image model to generate consistent content across multiple image scales, enabling extreme semantic zooms into a scene, e.g., ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. We achieve this through a joint multi-scale diffusion sampling approach that encourages consistency across different scales while preserving the integrity of each individual sampling process. Since each generated scale is guided by a different text prompt, our method enables deeper levels of zoom than traditional super-resolution methods that may struggle to create new contextual structure at vastly different scales. We compare our method qualitatively with alternative techniques in image super-resolution and outpainting, and show that our method is most effective at generating consistent multi-scale content.
Paper Structure (19 sections, 4 equations, 14 figures, 6 tables, 2 algorithms)

This paper contains 19 sections, 4 equations, 14 figures, 6 tables, 2 algorithms.

Figures (14)

  • Figure 1: Given a series of prompts describing a scene at varying zoom levels, e.g., from a distant galaxy to the surface of an alien planet, our method uses a pre-trained text-to-image diffusion model to generate a continuously zooming video sequence.
  • Figure 2: Powers of Ten (1977) This documentary film illustrates the relative scale of the universe as a single shot that gradually zooms out from a human to the universe, and then back again to the microscopic molecular level.
  • Figure 3: Zoom stack. Our representation consists of $N$ layer images $L_i$ of constant resolution (left). These layers are arranged in a pyramid-like structure, with layers representing finer details corresponding to a smaller spatial extent (middle). These layers are composited to form an image at any zoom level (right).
  • Figure 4: Overview of a single sampling step. (1) Noisy images $\mathbf{z}_{i, t}$ from each zoom level, along with the respective prompts $y_i$ are simultaneously fed into the same pretrained diffusion model, returning estimates of the corresponding clean images $\hat{\mathbf{x}}_{i, t}$. These images may have inconsistent estimates for the overlapping regions that they all observe. We employ multi-resolution blending to fuse these regions into a consistent zoom stack $\mathcal{L}_{t}$ and re-render the different zoom levels from the consistent representation. These re-rendered images $\Pi_{\text{image}}(\mathcal{L}_t; i)$ are then used as the clean image estimates in the DDPM sampling step.
  • Figure 5: Multi-resolution blending. We produce a consistent estimate for Layer $L_i$ in the zoom stack by merging the $H/p_j\times W/p_j$ central region of the corresponding zoomed out images $\mathbf{x}_{j}$ for $j \leq i$. This merging process involves (1) creating a Laplacian pyramid from each observation, and blending together the corresponding frequency bands to create a blended pyramid. This blended pyramid is recomposed into an image, which is used to update the layer $L_i$.
  • ...and 9 more figures