Table of Contents
Fetching ...

Image Neural Field Diffusion Models

Yinbo Chen, Oliver Wang, Richard Zhang, Eli Shechtman, Xiaolong Wang, Michael Gharbi

TL;DR

This work introduces Image Neural Field Diffusion (INFD), a two-stage framework that learns a diffusion model on latent image neural fields rather than fixed-resolution RGB images. An image neural field autoencoder (E, D, R) maps images to a latent z that renders photorealistic, scale-consistent outputs via the Convolutional Local Image Function (CLIF) renderer; a latent diffusion model then generates z from which high-resolution images are rendered on demand. The method supports mixed-resolution training, patchwise supervision, and high-resolution synthesis (up to 2K) without separate super-resolution models, while enabling efficient multi-scale inverse problems using region-wise CLIP constraints. Compared to latent diffusion models with SR, INFD achieves better high-resolution detail and avoids SR-domain artifacts, demonstrating strong performance on FFHQ and Mountains data and showing flexibility for text-to-image tasks and beyond-1K resolutions.

Abstract

Diffusion models have shown an impressive ability to model complex data distributions, with several key advantages over GANs, such as stable training, better coverage of the training distribution's modes, and the ability to solve inverse problems without extra training. However, most diffusion models learn the distribution of fixed-resolution images. We propose to learn the distribution of continuous images by training diffusion models on image neural fields, which can be rendered at any resolution, and show its advantages over fixed-resolution models. To achieve this, a key challenge is to obtain a latent space that represents photorealistic image neural fields. We propose a simple and effective method, inspired by several recent techniques but with key changes to make the image neural fields photorealistic. Our method can be used to convert existing latent diffusion autoencoders into image neural field autoencoders. We show that image neural field diffusion models can be trained using mixed-resolution image datasets, outperform fixed-resolution diffusion models followed by super-resolution models, and can solve inverse problems with conditions applied at different scales efficiently.

Image Neural Field Diffusion Models

TL;DR

This work introduces Image Neural Field Diffusion (INFD), a two-stage framework that learns a diffusion model on latent image neural fields rather than fixed-resolution RGB images. An image neural field autoencoder (E, D, R) maps images to a latent z that renders photorealistic, scale-consistent outputs via the Convolutional Local Image Function (CLIF) renderer; a latent diffusion model then generates z from which high-resolution images are rendered on demand. The method supports mixed-resolution training, patchwise supervision, and high-resolution synthesis (up to 2K) without separate super-resolution models, while enabling efficient multi-scale inverse problems using region-wise CLIP constraints. Compared to latent diffusion models with SR, INFD achieves better high-resolution detail and avoids SR-domain artifacts, demonstrating strong performance on FFHQ and Mountains data and showing flexibility for text-to-image tasks and beyond-1K resolutions.

Abstract

Diffusion models have shown an impressive ability to model complex data distributions, with several key advantages over GANs, such as stable training, better coverage of the training distribution's modes, and the ability to solve inverse problems without extra training. However, most diffusion models learn the distribution of fixed-resolution images. We propose to learn the distribution of continuous images by training diffusion models on image neural fields, which can be rendered at any resolution, and show its advantages over fixed-resolution models. To achieve this, a key challenge is to obtain a latent space that represents photorealistic image neural fields. We propose a simple and effective method, inspired by several recent techniques but with key changes to make the image neural fields photorealistic. Our method can be used to convert existing latent diffusion autoencoders into image neural field autoencoders. We show that image neural field diffusion models can be trained using mixed-resolution image datasets, outperform fixed-resolution diffusion models followed by super-resolution models, and can solve inverse problems with conditions applied at different scales efficiently.
Paper Structure (40 sections, 3 equations, 22 figures, 6 tables)

This paper contains 40 sections, 3 equations, 22 figures, 6 tables.

Figures (22)

  • Figure 1: Generated samples from our image neural field diffusion models. We show photorealistic high-resolution image generation by rendering generated image neural fields at 2K resolution for single domain models (left and middle), as well as general text-to-image models (right), with an efficient diffusion process on latent representation at only 64 × 64 resolution.
  • Figure 2: Method overview. Given a training image at an arbitrary resolution, we first downsample it to a fixed resolution and pass it into the encoder $E$ to get a latent representation $z$. A decoder $D$ then takes $z$ as input and produces a feature map $\phi$ that drives a neural field renderer $R$, which can render images by querying with the appropriate grid of pixel coordinates $c$ and pixel sizes $s$. The autoencoder is trained on crops from a randomly downsampled image ground truth, generating image crops at the corresponding coordinates. At test time, a diffusion model generates a latent representation $z$, which is then decoded and used to render a high-resolution image.
  • Figure 3: Convolutional Local Image Function (CLIF). Given a feature map $\phi$ (yellow dots), for each query point $x$ (green dot), we fetch the nearest feature vector, along with the relative coordinates and the pixel size. The grid of query information is then passed into a convolutional network (right) that renders an RGB grid. Different than the pointwise function LIIF, CLIF has a higher generation capacity and is learned to be still scale-consistent.
  • Figure 4: Generated samples from our method on FFHQ and Mountains dataset.
  • Figure 5: Qualitative comparison with LDM followed by super-resolution on FFHQ. LIIF shows noise in details, while Real-ESRGAN tends to be smooth and results lack rich details. Our approach generates images with more realistic details.
  • ...and 17 more figures