Table of Contents
Fetching ...

$\infty$-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions

Minh-Quan Le, Alexandros Graikos, Srikar Yellapragada, Rajarsi Gupta, Joel Saltz, Dimitris Samaras

TL;DR

This paper tackles controllable, high-resolution image synthesis in domains requiring very large images, where traditional finite-dimensional diffusion models and patch-based methods struggle to preserve global structures or scale efficiently.It introduces $\infty$-Brush, a conditional diffusion model operating in function space with a cross-attention neural operator to condition in $\mathcal{H}$, enabling arbitrary resolutions up to $4096\times4096$ while training on only $0.4\%$ of pixels via a smoothing operator $\mathbf{A}$.Key contributions include the first conditional diffusion framework in infinite dimensions, the cross-attention neural operator for function-space conditioning, and a two-level denoiser (sparse grid) that maintains global coherence and local detail under large-scale generation.Empirical results on histopathology and satellite imagery demonstrate strong global-structure fidelity (CLIP-FID) and competitive local detail (Crop-FID) with favorable computational efficiency compared to finite-dimension baselines.

Abstract

Synthesizing high-resolution images from intricate, domain-specific information remains a significant challenge in generative modeling, particularly for applications in large-image domains such as digital histopathology and remote sensing. Existing methods face critical limitations: conditional diffusion models in pixel or latent space cannot exceed the resolution on which they were trained without losing fidelity, and computational demands increase significantly for larger image sizes. Patch-based methods offer computational efficiency but fail to capture long-range spatial relationships due to their overreliance on local information. In this paper, we introduce a novel conditional diffusion model in infinite dimensions, $\infty$-Brush for controllable large image synthesis. We propose a cross-attention neural operator to enable conditioning in function space. Our model overcomes the constraints of traditional finite-dimensional diffusion models and patch-based methods, offering scalability and superior capability in preserving global image structures while maintaining fine details. To our best knowledge, $\infty$-Brush is the first conditional diffusion model in function space, that can controllably synthesize images at arbitrary resolutions of up to $4096\times4096$ pixels. The code is available at https://github.com/cvlab-stonybrook/infinity-brush.

$\infty$-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions

TL;DR

This paper tackles controllable, high-resolution image synthesis in domains requiring very large images, where traditional finite-dimensional diffusion models and patch-based methods struggle to preserve global structures or scale efficiently.It introduces $\infty$-Brush, a conditional diffusion model operating in function space with a cross-attention neural operator to condition in $\mathcal{H}$, enabling arbitrary resolutions up to $4096\times4096$ while training on only $0.4\%$ of pixels via a smoothing operator $\mathbf{A}$.Key contributions include the first conditional diffusion framework in infinite dimensions, the cross-attention neural operator for function-space conditioning, and a two-level denoiser (sparse grid) that maintains global coherence and local detail under large-scale generation.Empirical results on histopathology and satellite imagery demonstrate strong global-structure fidelity (CLIP-FID) and competitive local detail (Crop-FID) with favorable computational efficiency compared to finite-dimension baselines.

Abstract

Synthesizing high-resolution images from intricate, domain-specific information remains a significant challenge in generative modeling, particularly for applications in large-image domains such as digital histopathology and remote sensing. Existing methods face critical limitations: conditional diffusion models in pixel or latent space cannot exceed the resolution on which they were trained without losing fidelity, and computational demands increase significantly for larger image sizes. Patch-based methods offer computational efficiency but fail to capture long-range spatial relationships due to their overreliance on local information. In this paper, we introduce a novel conditional diffusion model in infinite dimensions, -Brush for controllable large image synthesis. We propose a cross-attention neural operator to enable conditioning in function space. Our model overcomes the constraints of traditional finite-dimensional diffusion models and patch-based methods, offering scalability and superior capability in preserving global image structures while maintaining fine details. To our best knowledge, -Brush is the first conditional diffusion model in function space, that can controllably synthesize images at arbitrary resolutions of up to pixels. The code is available at https://github.com/cvlab-stonybrook/infinity-brush.
Paper Structure (23 sections, 8 theorems, 40 equations, 11 figures, 5 tables)

This paper contains 23 sections, 8 theorems, 40 equations, 11 figures, 5 tables.

Key Result

proposition thmcounterproposition

The cross-entropy of conditional diffusion models in function space has a variational upper bound of

Figures (11)

  • Figure 1: $\infty$-Brush is able to controllably generate images at arbitrary resolutions of up to $4096 \times 4096$, conditioned on any available auxiliary information about the images.
  • Figure 2: Given a noisy function $\mathbf{u} \in \mathcal{H}$, we discretize it by randomly selecting a subset of coordinates $\mathbf{x} = \{\mathbf{x}^{(i)}\}_{1 \le i \le N} \subset \mathcal{X}$ then feed it into our conditional denoiser returning a denoised function $\mathbf{s} \in \mathcal{H}$. The conditional denoiser architecture of $\infty$-Brush includes a sparse level and a grid level. The sparse level (in blue) utilizes a sparse neural operator, a cross-attention neural operator, and a self-attention neural operator, focusing on capturing fine-grained details. The grid level (in pink) targets global information. We use k-NN linear interpolation to transform the sparse points to a regularly spaced grid.
  • Figure 3: Large images ($1024 \times 1024$) generated from our $\infty$-Brush, conditioned on the facial attribute blonde/non-blonde hair.
  • Figure 4: Very large ($4096 \times 4096$) and large ($1024 \times 1024$) images generated from $\infty$-Brush, and the corresponding reference real images used to generate them. Given a single embedding vector of a downsampled $256\times256$ real image, $\infty$-Brush can synthesize images of up to $4096 \times 4096$ and preserve global structures of the reference image.
  • Figure 5: Long-range dependencies comparison between our $\infty$-Brush and patched-based method graikos2023learned. $\infty$-Brush retains large-scale structures (such as clearly-separated clusters of cells) that can span multiple patches in comparison to the image generated from graikos2023learned.
  • ...and 6 more figures

Theorems & Definitions (15)

  • proposition thmcounterproposition: Learning Objective
  • proof
  • lemma thmcounterlemma: Measure Equivalence - The Feldman-Hájek Theorem
  • lemma thmcounterlemma: The Radon-Nikodym Derivative
  • proof
  • theorem thmcountertheorem: Conditional Diffusion Optimality in Function Space
  • proof
  • proposition thmcounterproposition: Learning Objective
  • proof
  • lemma thmcounterlemma: Measure Equivalence - The Feldman-Hájek Theorem
  • ...and 5 more