Table of Contents
Fetching ...

SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation

Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Aditya Grover, Jason Kuen

Abstract

Recent advancements in discrete image generation showed that scaling the VQ codebook size significantly improves reconstruction fidelity. However, training generative models with a large VQ codebook remains challenging, typically requiring larger model size and a longer training schedule. In this work, we propose Stochastic Neighbor Cross Entropy Minimization (SNCE), a novel training objective designed to address the optimization challenges of large-codebook discrete image generators. Instead of supervising the model with a hard one-hot target, SNCE constructs a soft categorical distribution over a set of neighboring tokens. The probability assigned to each token is proportional to the proximity between its code embedding and the ground-truth image embedding, encouraging the model to capture semantically meaningful geometric structure in the quantized embedding space. We conduct extensive experiments across class-conditional ImageNet-256 generation, large-scale text-to-image synthesis, and image editing tasks. Results show that SNCE significantly improves convergence speed and overall generation quality compared to standard cross-entropy objectives.

SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation

Abstract

Recent advancements in discrete image generation showed that scaling the VQ codebook size significantly improves reconstruction fidelity. However, training generative models with a large VQ codebook remains challenging, typically requiring larger model size and a longer training schedule. In this work, we propose Stochastic Neighbor Cross Entropy Minimization (SNCE), a novel training objective designed to address the optimization challenges of large-codebook discrete image generators. Instead of supervising the model with a hard one-hot target, SNCE constructs a soft categorical distribution over a set of neighboring tokens. The probability assigned to each token is proportional to the proximity between its code embedding and the ground-truth image embedding, encouraging the model to capture semantically meaningful geometric structure in the quantized embedding space. We conduct extensive experiments across class-conditional ImageNet-256 generation, large-scale text-to-image synthesis, and image editing tasks. Results show that SNCE significantly improves convergence speed and overall generation quality compared to standard cross-entropy objectives.
Paper Structure (28 sections, 27 equations, 4 figures, 6 tables)

This paper contains 28 sections, 27 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Limitations of Vanilla Cross Entropy Loss with One-hot Target. Given an input image, vanilla CE loss cannot distinguish between non-closet tokens in the embedding space, even though some of the tokens are close to ground truth in embedding space and can decode to semantically similar images. Addtional details of loss computation in thig figure can be found in appendix.
  • Figure 2: Toy examples on 2D Gaussians. (a) ground truth distribution (b) 100-sample dataset (c) discretized dataset (d) L2 regression results (e) CE loss results (f) SNCE loss results
  • Figure 3: Qualitative results. (Left) Text-to-image generation. (Right) Image editing.
  • Figure 4: Additional qualitative comparisons.