Table of Contents
Fetching ...

Next Visual Granularity Generation

Yikai Wang, Zhouxia Wang, Zhonghua Wu, Qingyi Tao, Kang Liao, Chen Change Loy

Abstract

We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 $\rightarrow$ 3.03, 2.57 $\rightarrow$ 2.44, 2.09 $\rightarrow$ 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models are released at https://yikai-wang.github.io/nvg.

Next Visual Granularity Generation

Abstract

We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 3.03, 2.57 2.44, 2.09 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models are released at https://yikai-wang.github.io/nvg.

Paper Structure

This paper contains 50 sections, 10 equations, 10 figures, 4 tables, 2 algorithms.

Figures (10)

  • Figure 1: We propose Next Visual Granularity (NVG) generation framework, representing images with a varying number of unique tokens, naturally forming different granularity levels. The induced structure maps reflect how these tokens are assigned across different spatial locations. The structure maps and unique tokens are iteratively generated to gradually refine the generated image.
  • Figure 2: The relationship between current generated images in each stage $\bm{x}_i$, the final generated image $\bm{x}$, and the content $\bm{c}_i$ and structure $\bm{s}_i$ of each stage in the visual granularity sequence.
  • Figure 3: We use a $K$-dim vector to encode the structure across all stages. At stage $0$, all locations belong to a single cluster, so we pad the vector with all $1$s. For stages $i > 0$, the embedding is inherited from the parent and extended with one extra bit ($0$ or $2$) to distinguish between child labels.
  • Figure 4: Left: Overview of our generation pipeline. At each stage, we first generate the structure and then generate the content based on that structure. Both steps are guided by the input text, the current canvas, and the current hierarchical structure. Right: Overview of how we obtain the target from the network predictions. (1): The structure generator predicts the overall structure embedding, from which we extract the channel for the next stage. (2): The content generator predicts the final canvas. We compute the residual between the predicted final canvas and the current canvas $\bm{x}_{i-1}$, and use it to obtain the next-stage content prediction.
  • Figure 5: Visualization of generated images. Top: We show several representative examples to illustrate the iterative generation process. Middle: The generated binary structure maps align well with the final images. Bottom: Our NVG-$d24$ model can generate diverse and high-quality images.
  • ...and 5 more figures