Table of Contents
Fetching ...

Language-Guided Image Tokenization for Generation

Kaiwen Zha, Lijun Yu, Alireza Fathi, David A. Ross, Cordelia Schmid, Dina Katabi, Xiuye Gu

TL;DR

<3-5 sentence high-level summary>TexTok introduces a text-conditioned image tokenization framework that uses image captions to guide the tokenizer and detokenizer, allocating learning capacity to fine-grained visual details within a compact latent space. By injecting caption embeddings via a frozen text encoder into ViT-based tokenizers and detokenizers, TexTok achieves substantial gains in reconstruction and generation quality across ImageNet resolutions while enabling major speedups by reducing the number of tokens required for generation. The method demonstrates state-of-the-art FID scores on ImageNet with competitive token budgets and supports effective text-to-image generation using captions with no extra annotation overhead. Overall, TexTok shows that language semantics can be leveraged at the tokenization stage to improve efficiency and fidelity in diffusion-based image generation.

Abstract

Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide a compact, high-level semantic representation. By conditioning the tokenization process on descriptive text captions, TexTok simplifies semantic learning, allowing more learning capacity and token space to be allocated to capture fine-grained visual details, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2% and 48.1% on ImageNet-256 and -512 benchmarks respectively, across varying numbers of tokens. These tokenization improvements consistently translate to 16.3% and 34.3% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve a 93.5x inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively. Furthermore, we demonstrate TexTok's superiority on the text-to-image generation task, effectively utilizing the off-the-shelf text captions in tokenization. Project page is at: https://kaiwenzha.github.io/textok/.

Language-Guided Image Tokenization for Generation

TL;DR

<3-5 sentence high-level summary>TexTok introduces a text-conditioned image tokenization framework that uses image captions to guide the tokenizer and detokenizer, allocating learning capacity to fine-grained visual details within a compact latent space. By injecting caption embeddings via a frozen text encoder into ViT-based tokenizers and detokenizers, TexTok achieves substantial gains in reconstruction and generation quality across ImageNet resolutions while enabling major speedups by reducing the number of tokens required for generation. The method demonstrates state-of-the-art FID scores on ImageNet with competitive token budgets and supports effective text-to-image generation using captions with no extra annotation overhead. Overall, TexTok shows that language semantics can be leveraged at the tokenization stage to improve efficiency and fidelity in diffusion-based image generation.

Abstract

Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide a compact, high-level semantic representation. By conditioning the tokenization process on descriptive text captions, TexTok simplifies semantic learning, allowing more learning capacity and token space to be allocated to capture fine-grained visual details, leading to enhanced reconstruction quality and higher compression rates. Compared to the conventional tokenizer without text conditioning, TexTok achieves average reconstruction FID improvements of 29.2% and 48.1% on ImageNet-256 and -512 benchmarks respectively, across varying numbers of tokens. These tokenization improvements consistently translate to 16.3% and 34.3% average improvements in generation FID. By simply replacing the tokenizer in Diffusion Transformer (DiT) with TexTok, our system can achieve a 93.5x inference speedup while still outperforming the original DiT using only 32 tokens on ImageNet-512. TexTok with a vanilla DiT generator achieves state-of-the-art FID scores of 1.46 and 1.62 on ImageNet-256 and -512 respectively. Furthermore, we demonstrate TexTok's superiority on the text-to-image generation task, effectively utilizing the off-the-shelf text captions in tokenization. Project page is at: https://kaiwenzha.github.io/textok/.

Paper Structure

This paper contains 35 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Reconstruction samples of TexTok compared with Baseline (w/o text) on ImageNet 256$\times$256 using different number of image tokens. TexTok enables the tokenizer to encode finer visual details into image tokens, achieving better reconstruction quality across various token counts, such as improved text in images, car wheels, and bird beaks. The improvement is particularly significant in the low-token domain. The yellow-boxed regions highlight the significant enhancements.
  • Figure 2: TexTok architecture. During training, a frozen text encoder (e.g., T5 raffel2020exploring) extracts text embeddings (tokens) from the given image caption. The image patches, learnable image tokens, and text tokens are fed into the tokenizer (a ViT dosovitskiy2021an) to produce the image tokens. During detokenization, the image tokens are concatenated with the same text tokens fed to the tokenizer and learnable patch tokens to reconstruct the image. For generation, only image tokens need to be generated.
  • Figure 3: Image reconstruction and generation performance comparison of TexTok with Baseline (w/o text) on ImageNet 256$\times$256 and 512$\times$512. TexTok consistently delivers significant improvements in image reconstruction and generation performance, with more pronounced gains as the number of tokens decreases. Class-conditional generation results are reported without classifier-free guidance (Baseline and TexTok use DiT-L as the generator, while SD-VAE uses DiT-XL/2). $^\dagger$: number taken from li2024autoregressive.
  • Figure 4: Speed/performance tradeoff of TexTok + DiT-XL compared to the original DiT-XL/2 on ImageNet 256$\times$256 and 512$\times$512. TexTok achieves the same generation performance 14.3$\times$/93.5$\times$ faster, or gains 34.0%/46.7% FID improvements using similar inference time. As image resolution scales up, this improvement is more pronounced. Each curve is obtained by using different sampling steps (50, 75, 150, 250). The inference time includes latent token generation, T5 text embedding extraction (for TexTok), and detokenization, measured on a single TPUv5e chip with a batch size of 32.
  • Figure 5: Qualitative text-to-image generation results of TexTok compared with Baseline (w/o text) on ImageNet 256$\times$256. TexTok generates higher-quality images that better follow the prompts compared to Baseline (w/o text). It even captures some fine-grained visual details presented in the reference images. The first row shows reference images from the ImageNet validation set along with their captions. Both TexTok and Baseline (w/o text) use the same generation settings and are conditioned on the same captions.
  • ...and 5 more figures