Table of Contents
Fetching ...

Image Generation with a Sphere Encoder

Kaiyu Yue, Menglin Jia, Ji Hou, Tom Goldstein

TL;DR

The Sphere Encoder introduces a spherical latent space autoencoder that enables high-quality image generation in a single forward pass, with few-step refinements achieving competitive results against diffusion models at far lower inference cost. The encoder maps natural images onto a sphere, while the decoder reconstructs images from sphere points; training relies on reconstruction and latent-space consistency losses rather than explicit priors. It supports conditional generation via AdaLN and CFG, and enabling iterative encode-decode cycles further enhances fidelity. Across CIFAR-10, Animal-Faces, Oxford-Flowers, and ImageNet, the approach delivers strong qualitative and quantitative results, with notable advantages in speed and flexibility for editing and cross-domain manipulation. This work opens avenues for fast, controllable generation and potential extensions to text-to-image tasks.

Abstract

We introduce the Sphere Encoder, an efficient generative framework capable of producing images in a single forward pass and competing with many-step diffusion models using fewer than five steps. Our approach works by learning an encoder that maps natural images uniformly onto a spherical latent space, and a decoder that maps random latent vectors back to the image space. Trained solely through image reconstruction losses, the model generates an image by simply decoding a random point on the sphere. Our architecture naturally supports conditional generation, and looping the encoder/decoder a few times can further enhance image quality. Across several datasets, the sphere encoder approach yields performance competitive with state of the art diffusions, but with a small fraction of the inference cost. Project page is available at https://sphere-encoder.github.io .

Image Generation with a Sphere Encoder

TL;DR

The Sphere Encoder introduces a spherical latent space autoencoder that enables high-quality image generation in a single forward pass, with few-step refinements achieving competitive results against diffusion models at far lower inference cost. The encoder maps natural images onto a sphere, while the decoder reconstructs images from sphere points; training relies on reconstruction and latent-space consistency losses rather than explicit priors. It supports conditional generation via AdaLN and CFG, and enabling iterative encode-decode cycles further enhances fidelity. Across CIFAR-10, Animal-Faces, Oxford-Flowers, and ImageNet, the approach delivers strong qualitative and quantitative results, with notable advantages in speed and flexibility for editing and cross-domain manipulation. This work opens avenues for fast, controllable generation and potential extensions to text-to-image tasks.

Abstract

We introduce the Sphere Encoder, an efficient generative framework capable of producing images in a single forward pass and competing with many-step diffusion models using fewer than five steps. Our approach works by learning an encoder that maps natural images uniformly onto a spherical latent space, and a decoder that maps random latent vectors back to the image space. Trained solely through image reconstruction losses, the model generates an image by simply decoding a random point on the sphere. Our architecture naturally supports conditional generation, and looping the encoder/decoder a few times can further enhance image quality. Across several datasets, the sphere encoder approach yields performance competitive with state of the art diffusions, but with a small fraction of the inference cost. Project page is available at https://sphere-encoder.github.io .
Paper Structure (25 sections, 13 equations, 19 figures, 16 tables, 1 algorithm)

This paper contains 25 sections, 13 equations, 19 figures, 16 tables, 1 algorithm.

Figures (19)

  • Figure 1: Selected images generated by the Sphere Encoder in one-step for CIFAR-10 ($32\times32$) and Animal-Faces, two-steps for Oxford-Flowers, and four-steps for ImageNet ($256\times256$).
  • Figure 2: A sphere encoder$E$ maps the natural image distribution uniformly onto a global sphere $S$. The decoder $D$ then generates a realistic image by decoding a random point on the sphere.
  • Figure 3: Posterior hole problem in VAEs. Columns: (1) Input images; (2) Autoencoder reconstructions; (3) Samples from standard Gaussian prior; and (4) Samples from estimated Gaussian posterior on Animal-Faces training set. Unlike modern FLUX.1/2 flux and SD-VAE sdxl, our sphere encoder produces realistic images by decoding random points sampled from the sphere.
  • Figure 4: Spherifying latent with noise. Encoder $E$ maps image $\mathbf{x}$ to a latent, which $f$ projects to $\mathbf{v}$ on sphere $S$. During training, random Gaussian noise $\sigma \cdot \mathbf{e}$ is added to $\mathbf{v}$, where $\sigma$ is jittered magnitude. Decoder $D$ reconstructs the image $\hat{\mathbf{x}}$ from the re-projected noisy latent $f(\mathbf{v} + \sigma \cdot \mathbf{e})$.
  • Figure 5: Uncurated CIFAR-10 conditional generation with different sampling steps and with/without CFG. Convincing images can be formed with a single forward pass, with reliability and gFID improving with up to 4 steps.
  • ...and 14 more figures