Image Generation with a Sphere Encoder

Kaiyu Yue; Menglin Jia; Ji Hou; Tom Goldstein

Image Generation with a Sphere Encoder

Kaiyu Yue, Menglin Jia, Ji Hou, Tom Goldstein

TL;DR

The Sphere Encoder introduces a spherical latent space autoencoder that enables high-quality image generation in a single forward pass, with few-step refinements achieving competitive results against diffusion models at far lower inference cost. The encoder maps natural images onto a sphere, while the decoder reconstructs images from sphere points; training relies on reconstruction and latent-space consistency losses rather than explicit priors. It supports conditional generation via AdaLN and CFG, and enabling iterative encode-decode cycles further enhances fidelity. Across CIFAR-10, Animal-Faces, Oxford-Flowers, and ImageNet, the approach delivers strong qualitative and quantitative results, with notable advantages in speed and flexibility for editing and cross-domain manipulation. This work opens avenues for fast, controllable generation and potential extensions to text-to-image tasks.

Abstract

We introduce the Sphere Encoder, an efficient generative framework capable of producing images in a single forward pass and competing with many-step diffusion models using fewer than five steps. Our approach works by learning an encoder that maps natural images uniformly onto a spherical latent space, and a decoder that maps random latent vectors back to the image space. Trained solely through image reconstruction losses, the model generates an image by simply decoding a random point on the sphere. Our architecture naturally supports conditional generation, and looping the encoder/decoder a few times can further enhance image quality. Across several datasets, the sphere encoder approach yields performance competitive with state of the art diffusions, but with a small fraction of the inference cost. Project page is available at https://sphere-encoder.github.io .

Image Generation with a Sphere Encoder

TL;DR

Abstract

Paper Structure (25 sections, 13 equations, 19 figures, 16 tables, 1 algorithm)

This paper contains 25 sections, 13 equations, 19 figures, 16 tables, 1 algorithm.

Introduction
Method
Spherical Latent Space
Spherifying with Noise
Training Objective
Model Architecture
Quantitative Experiments
Small Image Size
Large Image Size
Lower FID scores?
Qualitative Experiments
Image Editing
Main Ablations
Related Work
Conclusion
...and 10 more sections

Figures (19)

Figure 1: Selected images generated by the Sphere Encoder in one-step for CIFAR-10 ($32\times32$) and Animal-Faces, two-steps for Oxford-Flowers, and four-steps for ImageNet ($256\times256$).
Figure 2: A sphere encoder$E$ maps the natural image distribution uniformly onto a global sphere $S$. The decoder $D$ then generates a realistic image by decoding a random point on the sphere.
Figure 3: Posterior hole problem in VAEs. Columns: (1) Input images; (2) Autoencoder reconstructions; (3) Samples from standard Gaussian prior; and (4) Samples from estimated Gaussian posterior on Animal-Faces training set. Unlike modern FLUX.1/2 flux and SD-VAE sdxl, our sphere encoder produces realistic images by decoding random points sampled from the sphere.
Figure 4: Spherifying latent with noise. Encoder $E$ maps image $\mathbf{x}$ to a latent, which $f$ projects to $\mathbf{v}$ on sphere $S$. During training, random Gaussian noise $\sigma \cdot \mathbf{e}$ is added to $\mathbf{v}$, where $\sigma$ is jittered magnitude. Decoder $D$ reconstructs the image $\hat{\mathbf{x}}$ from the re-projected noisy latent $f(\mathbf{v} + \sigma \cdot \mathbf{e})$.
Figure 5: Uncurated CIFAR-10 conditional generation with different sampling steps and with/without CFG. Convincing images can be formed with a single forward pass, with reliability and gFID improving with up to 4 steps.
...and 14 more figures

Image Generation with a Sphere Encoder

TL;DR

Abstract

Image Generation with a Sphere Encoder

Authors

TL;DR

Abstract

Table of Contents

Figures (19)