Table of Contents
Fetching ...

Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens

Qingsong Xie, Zhao Zhang, Zhe Huang, Yanhao Zhang, Haonan Lu, Zhenyu Yang

TL;DR

Layton tackles the challenge of high-resolution image reconstruction and generation with limited discrete tokens by aligning a discrete image tokenizer with the latent space of pre-trained Latent Diffusion Models (LDMs). It introduces Latent Diffusion Reconstruction (LADD) and a latent consistency decoder to enable faithful 1024×1024 reconstructions from only 256 tokens, achieving strong rFID and perceptual metrics, and accelerates inference with few-step sampling via latent consistency models. Extending this framework, LaytonGen adopts an autoregressive transformer to predict token sequences conditioned on text, delivering state-of-the-art GenEval and COCO-quality scores on text-to-image tasks. Across ImageNet and MSCOCO benchmarks, Layton and LaytonGen demonstrate that discrete tokenization can be effectively fused with powerful latent decoders, enabling efficient, high-fidelity, high-resolution multimodal synthesis with practical computational costs.

Abstract

Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and fidelity: high-resolution image reconstruction either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Consistency Tokenizer (Layton) that bridges discrete visual tokens with the compact latent space of pre-trained Latent Diffusion Models (LDMs), enabling efficient representation of 1024x1024 images using only 256 tokens-a 16 times compression over VQGAN. Layton integrates a transformer encoder, a quantized codebook, and a latent consistency decoder. Direct application of LDM as the decoder results in color and brightness discrepancies. Thus, we convert it to latent consistency decoder, reducing multi-step sampling to 1-2 steps for direct pixel-level supervision. Experiments demonstrate Layton's superiority in high-fidelity reconstruction, with 10.8 reconstruction Frechet Inception Distance on MSCOCO-2017 5K benchmark for 1024x1024 image reconstruction. We also extend Layton to a text-to-image generation model, LaytonGen, working in autoregression. It achieves 0.73 score on GenEval benchmark, surpassing current state-of-the-art methods. Project homepage: https://github.com/OPPO-Mente-Lab/Layton

Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens

TL;DR

Layton tackles the challenge of high-resolution image reconstruction and generation with limited discrete tokens by aligning a discrete image tokenizer with the latent space of pre-trained Latent Diffusion Models (LDMs). It introduces Latent Diffusion Reconstruction (LADD) and a latent consistency decoder to enable faithful 1024×1024 reconstructions from only 256 tokens, achieving strong rFID and perceptual metrics, and accelerates inference with few-step sampling via latent consistency models. Extending this framework, LaytonGen adopts an autoregressive transformer to predict token sequences conditioned on text, delivering state-of-the-art GenEval and COCO-quality scores on text-to-image tasks. Across ImageNet and MSCOCO benchmarks, Layton and LaytonGen demonstrate that discrete tokenization can be effectively fused with powerful latent decoders, enabling efficient, high-fidelity, high-resolution multimodal synthesis with practical computational costs.

Abstract

Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and fidelity: high-resolution image reconstruction either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Consistency Tokenizer (Layton) that bridges discrete visual tokens with the compact latent space of pre-trained Latent Diffusion Models (LDMs), enabling efficient representation of 1024x1024 images using only 256 tokens-a 16 times compression over VQGAN. Layton integrates a transformer encoder, a quantized codebook, and a latent consistency decoder. Direct application of LDM as the decoder results in color and brightness discrepancies. Thus, we convert it to latent consistency decoder, reducing multi-step sampling to 1-2 steps for direct pixel-level supervision. Experiments demonstrate Layton's superiority in high-fidelity reconstruction, with 10.8 reconstruction Frechet Inception Distance on MSCOCO-2017 5K benchmark for 1024x1024 image reconstruction. We also extend Layton to a text-to-image generation model, LaytonGen, working in autoregression. It achieves 0.73 score on GenEval benchmark, surpassing current state-of-the-art methods. Project homepage: https://github.com/OPPO-Mente-Lab/Layton

Paper Structure

This paper contains 19 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a)Overview of Layton. The input image is sequentially processed by downsampler, encoder and quantizer into condition features. It also goes through the VAE Encoder in pretrianed LDM to produce latent $z_0$, which will be diffused to produce $z_t$. LADD takes $C$, $z_t$ and $t$ as input. In the first phase, we apply $\mathcal{L}_{DF}$ to train LADD. In the second phase, we introduce acceleration models to LADD and replace $z_t$ with Gaussian noise to perform one or two step inference, which allows us to train LADD with pixel reconstruction loss $\mathcal{L}_{PR}$. For simplicity, we omit the time step $t$. (b) Illustration of autoregressive text-to-image generation with LaytonGen.
  • Figure 2: Visual comparisons of images reconstruction for different methods. Layton can achieve much better reconstruction results than VQGAN, TiTok, and LlamaGen, especially in facial reconstruction. The images reconstructed by Layton-H* and Layton-T* show higher quality than other methods, even surpass ground truth (GT).
  • Figure 3: Comparison of text-conditioned generation of different methods. From left to right, (a)HARTtang2024hart, (b)SD1.5rombach2022high, (c)SDXLpodell2023sdxl, (d)SD3esser2024sd3, (e)Show-oxie2024show, (f)LlamaGensun2024llamagen, (g)LaytonGen-H, (h)LaytonGen-H* and (i)LaytonGen-T*. Apart from satisfactory visual quality, Layton can also yield improved metrics compared to strong baselines.
  • Figure 4: More examples on visual comparisons of images reconstruction with different methods. From left to right, ground truth, VQGAN, TiTok, LLamaGen, Layton-H, Layton-H* and Layton-T*.
  • Figure 5: More examples on text-conditioned generation of different methods. From left to right, (a)HART, (b)SD1.5, (c)SDXL, (d)SD3, (e)Show-o, (f)LlamaGen, (g)LaytonGen-H, (h)LaytonGen-H* and (i)LaytonGen-T*.