Image Neural Field Diffusion Models
Yinbo Chen, Oliver Wang, Richard Zhang, Eli Shechtman, Xiaolong Wang, Michael Gharbi
TL;DR
This work introduces Image Neural Field Diffusion (INFD), a two-stage framework that learns a diffusion model on latent image neural fields rather than fixed-resolution RGB images. An image neural field autoencoder (E, D, R) maps images to a latent z that renders photorealistic, scale-consistent outputs via the Convolutional Local Image Function (CLIF) renderer; a latent diffusion model then generates z from which high-resolution images are rendered on demand. The method supports mixed-resolution training, patchwise supervision, and high-resolution synthesis (up to 2K) without separate super-resolution models, while enabling efficient multi-scale inverse problems using region-wise CLIP constraints. Compared to latent diffusion models with SR, INFD achieves better high-resolution detail and avoids SR-domain artifacts, demonstrating strong performance on FFHQ and Mountains data and showing flexibility for text-to-image tasks and beyond-1K resolutions.
Abstract
Diffusion models have shown an impressive ability to model complex data distributions, with several key advantages over GANs, such as stable training, better coverage of the training distribution's modes, and the ability to solve inverse problems without extra training. However, most diffusion models learn the distribution of fixed-resolution images. We propose to learn the distribution of continuous images by training diffusion models on image neural fields, which can be rendered at any resolution, and show its advantages over fixed-resolution models. To achieve this, a key challenge is to obtain a latent space that represents photorealistic image neural fields. We propose a simple and effective method, inspired by several recent techniques but with key changes to make the image neural fields photorealistic. Our method can be used to convert existing latent diffusion autoencoders into image neural field autoencoders. We show that image neural field diffusion models can be trained using mixed-resolution image datasets, outperform fixed-resolution diffusion models followed by super-resolution models, and can solve inverse problems with conditions applied at different scales efficiently.
