Table of Contents
Fetching ...

ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

Haodong Yu, Yabo Zhang, Donglin Di, Ruyi Zhang, Wangmeng Zuo

Abstract

While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial fragmentation. This limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional dimensions. To overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core innovations. By mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural integrity. Specifically, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.

ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

Abstract

While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial fragmentation. This limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional dimensions. To overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core innovations. By mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural integrity. Specifically, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.
Paper Structure (15 sections, 7 equations, 7 figures, 3 tables)

This paper contains 15 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Imagery of ScrollScape. ScrollScape reformulates high resolution synthesis at extreme aspect ratios such as $8:1$ as a sequential video panning task. Leveraging robust video diffusion priors, it achieves exceptional $32\text{K}$ resolution across canvases ranging from traditional scrolls to photorealistic panoramas.
  • Figure 2: Overview of ScrollScape Framework. ScrollScape reformulates high resolution EAR synthesis as a sequential video panning task. ScanPE re engineers coordinate distributions by mapping global spatial indices $(x, y)$ onto a temporal sequence to ensure structural coherence across massive scales without the repetition typical of standard models.After generating low resolution latents via a hierarchical DiT, the ScrollSR module utilizes video super resolution priors to enhance details frame by frame. Finally, a 3D VAE decoder and frame fusion stage produce seamless, photorealistic panoramas and traditional scrolls at an exceptional $32\text{K}$ resolution.
  • Figure 3: Qualitative comparison on 8:1 horizontal panoramic scroll imagery. ScrollScape achieves rigorous structural coherence and expansive global diversity, whereas baselines suffer from the semantic repetition and boundary artifacts common in tiled synthesis.
  • Figure 4: Qualitative results on 8:1 vertical scroll paintings. ScrollScape maintains superior global structural coherence and expansive diversity, effectively eliminating the boundary artifacts and semantic repetition found in existing frameworks.
  • Figure 5: Demonstration of high fidelity 8K imagery generated by ScrollScape. The results showcase the versatility of ScrollScape in producing high fidelity imagery with structural richness and local clarity across various subject matters, ranging from microscopic textures to macroscopic landscapes.Please zoom in for high resolution details.
  • ...and 2 more figures