Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens
Qingsong Xie, Zhao Zhang, Zhe Huang, Yanhao Zhang, Haonan Lu, Zhenyu Yang
TL;DR
Layton tackles the challenge of high-resolution image reconstruction and generation with limited discrete tokens by aligning a discrete image tokenizer with the latent space of pre-trained Latent Diffusion Models (LDMs). It introduces Latent Diffusion Reconstruction (LADD) and a latent consistency decoder to enable faithful 1024×1024 reconstructions from only 256 tokens, achieving strong rFID and perceptual metrics, and accelerates inference with few-step sampling via latent consistency models. Extending this framework, LaytonGen adopts an autoregressive transformer to predict token sequences conditioned on text, delivering state-of-the-art GenEval and COCO-quality scores on text-to-image tasks. Across ImageNet and MSCOCO benchmarks, Layton and LaytonGen demonstrate that discrete tokenization can be effectively fused with powerful latent decoders, enabling efficient, high-fidelity, high-resolution multimodal synthesis with practical computational costs.
Abstract
Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and fidelity: high-resolution image reconstruction either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Consistency Tokenizer (Layton) that bridges discrete visual tokens with the compact latent space of pre-trained Latent Diffusion Models (LDMs), enabling efficient representation of 1024x1024 images using only 256 tokens-a 16 times compression over VQGAN. Layton integrates a transformer encoder, a quantized codebook, and a latent consistency decoder. Direct application of LDM as the decoder results in color and brightness discrepancies. Thus, we convert it to latent consistency decoder, reducing multi-step sampling to 1-2 steps for direct pixel-level supervision. Experiments demonstrate Layton's superiority in high-fidelity reconstruction, with 10.8 reconstruction Frechet Inception Distance on MSCOCO-2017 5K benchmark for 1024x1024 image reconstruction. We also extend Layton to a text-to-image generation model, LaytonGen, working in autoregression. It achieves 0.73 score on GenEval benchmark, surpassing current state-of-the-art methods. Project homepage: https://github.com/OPPO-Mente-Lab/Layton
