Table of Contents
Fetching ...

Learning Images Across Scales Using Adversarial Training

Krzysztof Wolski, Adarsh Djeacoumar, Alireza Javanmardi, Hans-Peter Seidel, Christian Theobalt, Guillaume Cordonnier, Karol Myszkowski, George Drettakis, Xingang Pan, Thomas Leimkühler

TL;DR

This work addresses learning a coherent, continuous scale-space representation from unstructured, low-resolution image patches, enabling exploration of content across orders of magnitude in scale. It introduces a multiscale generator based on an alias-free StyleGAN3 augmented with progressively distributed Fourier features, coupled with a scale-consistency loss and a progressive patch-sampling strategy to stabilize training across large scale spans. The methodology supports two modes: multiscale pseudo-reconstruction of a single underlying scale space and multiscale generation across environments, achieving up to 256× zoom with high scale coherence and competitive perceptual quality. The approach yields substantial data compression advantages and enables interactive rendering at around 20 frames per second, offering a new direction for efficient, scalable image representations and synthesis across wide scale ranges.

Abstract

The real world exhibits rich structure and detail across many scales of observation. It is difficult, however, to capture and represent a broad spectrum of scales using ordinary images. We devise a novel paradigm for learning a representation that captures an orders-of-magnitude variety of scales from an unstructured collection of ordinary images. We treat this collection as a distribution of scale-space slices to be learned using adversarial training, and additionally enforce coherency across slices. Our approach relies on a multiscale generator with carefully injected procedural frequency content, which allows to interactively explore the emerging continuous scale space. Training across vastly different scales poses challenges regarding stability, which we tackle using a supervision scheme that involves careful sampling of scales. We show that our generator can be used as a multiscale generative model, and for reconstructions of scale spaces from unstructured patches. Significantly outperforming the state of the art, we demonstrate zoom-in factors of up to 256x at high quality and scale consistency.

Learning Images Across Scales Using Adversarial Training

TL;DR

This work addresses learning a coherent, continuous scale-space representation from unstructured, low-resolution image patches, enabling exploration of content across orders of magnitude in scale. It introduces a multiscale generator based on an alias-free StyleGAN3 augmented with progressively distributed Fourier features, coupled with a scale-consistency loss and a progressive patch-sampling strategy to stabilize training across large scale spans. The methodology supports two modes: multiscale pseudo-reconstruction of a single underlying scale space and multiscale generation across environments, achieving up to 256× zoom with high scale coherence and competitive perceptual quality. The approach yields substantial data compression advantages and enables interactive rendering at around 20 frames per second, offering a new direction for efficient, scalable image representations and synthesis across wide scale ranges.

Abstract

The real world exhibits rich structure and detail across many scales of observation. It is difficult, however, to capture and represent a broad spectrum of scales using ordinary images. We devise a novel paradigm for learning a representation that captures an orders-of-magnitude variety of scales from an unstructured collection of ordinary images. We treat this collection as a distribution of scale-space slices to be learned using adversarial training, and additionally enforce coherency across slices. Our approach relies on a multiscale generator with carefully injected procedural frequency content, which allows to interactively explore the emerging continuous scale space. Training across vastly different scales poses challenges regarding stability, which we tackle using a supervision scheme that involves careful sampling of scales. We show that our generator can be used as a multiscale generative model, and for reconstructions of scale spaces from unstructured patches. Significantly outperforming the state of the art, we demonstrate zoom-in factors of up to 256x at high quality and scale consistency.
Paper Structure (21 sections, 8 equations, 14 figures, 3 tables)

This paper contains 21 sections, 8 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Different paradigms to obtain a multiscale image representation. Orange blocks indicate the location of the input data in scale space (trapezoid). a) Level-of-detail methods require a full image at the finest scale and construct the scale space using low-pass filtering. b) Super-resolution infers slightly finer scales from a coarse-scale image. c) Approaches relying on structured aggregation assume registered images. d) Our approach relies on an unstructured collection of low-resolution input images: The locations of the images in scale space are unknown (question marks in the orange blocks) and do not even necessarily depict the same scene. We nevertheless produce full coherent scale spaces.
  • Figure 2: Typical samples from a multiscale dataset. The images have a fairly low resolution ($256 \times 256$ for us) and are unstructured, i.e., we do not have information about relative 2D location, allowing uncomplicated capture or collection without the need for registration. Images courtesy of Bartosz Wojczyński wojczynski2021.
  • Figure 3: (a) A scale space is a multiscale representation of an image. It is a continuous function of spatial coordinates $\TextOrMath{$x$\xspace}{\mathbf{x}} = (x, y)^T$ and bandwidth $\omega$ω. Increasing $\omega$ω introduces higher and higher frequencies. (b) An $x$-$\omega$ω-slice through the volume in a). The resolution of spatial discretizations (white grids) needs to be adapted to a given $\omega$ω to capture all frequency content. Input to our method is an unstructured collection of 2D image patches that sample the scale space (green bars). Each patch has a continuous location and scale. All patches have the same resolution ($\TextOrMath{$$N$N_p$\xspace}{{\TextOrMath{$N$\xspace}{N}_p}}=8$ in this visualization), which leads to different coverage of the spatial domain depending on their scale. Our method generates orders-of-magnitude scale spaces from this unstructured information. Notice that it is difficult to depict the actual resolution levels in a figure this size. We consider scale spaces, where the image at the top already requires a resolution of $256 \times 256$, leading to tens of thousands of pixels at the bottom.
  • Figure 4: Overview of our approach. Our multiscale generator $G$G takes a patch location ${\TextOrMath{$x$\xspace}{\mathbf{x}}_p}$$\mathbf{x}$x_p and scale ${\TextOrMath{$s$\xspace}{s}_p}$$s$s_p, as well as a random seed $\mathbf{z}$z as input and synthesizes a corresponding image. A discriminator $D$D compares the distributions of synthesized and data patches. Our generator architecture augments an alias-free StyleGAN with carefully designed Fourier features that are distributed across network layers, which allows to synthesize image patches from continuous orders-of-magnitude scale spaces. Dataset patches courtesy of rembrandt.
  • Figure 5: (a) The StyleGAN3 generator rasterizes Fourier features $\mathbf{f}$f (one is shown) and feeds them through a synthesis network $S$S to obtain an output image. (b) A spatial offset of $\mathbf{f}$f results in a shifted image. (c) Scaling up $\mathbf{f}$f leads to flat feature maps, which $S$S cannot translate into a meaningful image. In the setups a)-c), the features $\mathbf{f}$f are not modulated (constant weighting $w$w).(d) Progressively blending in different $\TextOrMath{$f$\xspace}{\mathbf{f}}$ using the weighting function $\TextOrMath{$w$\xspace}{w}(\TextOrMath{$$s$s_p$\xspace}{{\TextOrMath{$s$\xspace}{s}_p}})$ results in a permanent re-scaling of individual features $\mathbf{f}$f (three out of many are shown), leading to unstable results. (e) We create Fourier features in bins (two bins -- pink and blue -- are shown) and blend in all features per bin at the same time. This leads to a significant reduction of blending (here, only the pink bin needs blending). Additionally, we inject features into different layers of $S$S, significantly enhancing coherency across scales.
  • ...and 9 more figures