Table of Contents
Fetching ...

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, Liang-Chieh Chen

TL;DR

This work presents TA-TiTok, a text-aware 1D image tokenizer, and MaskGen, open-data masked generative models for text-to-image synthesis. By enabling one-stage training, supporting both discrete and continuous 1D tokens, and incorporating CLIP-based text guidance at de-tokenization, TA-TiTok achieves strong semantic alignment with minimal overhead. MaskGen builds on TA-TiTok with a Diffusion Transformer, open-data training, and aesthetic-score conditioning to deliver competitive T2I performance against private-data baselines, including strong results on MJHQ-30K and COCO while maintaining high sampling efficiency. Together, these contributions democratize access to high-performance masked T2I models and provide a reproducible, open-data pathway for future research.

Abstract

Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

TL;DR

This work presents TA-TiTok, a text-aware 1D image tokenizer, and MaskGen, open-data masked generative models for text-to-image synthesis. By enabling one-stage training, supporting both discrete and continuous 1D tokens, and incorporating CLIP-based text guidance at de-tokenization, TA-TiTok achieves strong semantic alignment with minimal overhead. MaskGen builds on TA-TiTok with a Diffusion Transformer, open-data training, and aesthetic-score conditioning to deliver competitive T2I performance against private-data baselines, including strong results on MJHQ-30K and COCO while maintaining high sampling efficiency. Together, these contributions democratize access to high-performance masked T2I models and provide a reproducible, open-data pathway for future research.

Abstract

Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.
Paper Structure (20 sections, 2 equations, 16 figures, 13 tables)

This paper contains 20 sections, 2 equations, 16 figures, 13 tables.

Figures (16)

  • Figure 1: Text-to-Image (T2I) Generation Results by MaskGen. MaskGen, powered by the proposed compact text-aware 1D tokenizer TA-TiTok, is an efficient masked generative model that achieves state-of-the-art performance on multiple T2I benchmarks using only open data. The open-data, open-weight MaskGen models are designed to promote broader access and democratize T2I masked generative models.
  • Figure 2: Overview of TA-TiTok (Text-Aware Transformer-based 1-Dimensional Tokenizer). (a) TA-TiTok introduces three key enhancements to TiTok yu2024image: First, an efficient one-stage training procedure replaces the need for a complex two-stage pipeline. Second, TA-TiTok supports 1D tokens in both discrete (VQ) and continuous (KL) formats. Third, it incorporates textual information (using CLIP’s text encoder) during de-tokenization to improve semantic alignment with text captions. (b) A comparison of reconstruction results shows that TA-TiTok achieves superior reconstruction quality over TiTok.
  • Figure 3: Overview of MaskGen. MaskGen is a family of text-to-image masked generative models that supports both discrete (VQ variant) and continuous (KL variant) token representations. For discrete tokens, MaskGen is trained with cross-entropy loss chang2022maskgit, while for continuous tokens, it employs diffusion loss li2024autoregressive. The architecture is designed by concatenating text conditions with TA-TiTok's latent tokens (both masked and unmasked) and feeding them into Diffusion Transformer blocks peebles2023scalable, with separate adaptive LayerNorms (adaLN), linear projections, and feedforward networks (FFN) for text and image modalities, following MM-DiT esser2024scaling. Additionally, aesthetic scores are incorporated as conditioning signals via adaLN. To encode captions, MaskGen uses the CLIP text encoder radford2021learning instead of the more resource-intensive T5-XXL raffel2020exploring, making it more accessible to research groups with limited computational resources.
  • Figure 4: Visualization of Latent Token Attention Map and Latent Code Swapping. The results are from VQ variant of TA-TiTok with 32 tokens. Each latent token attends to prominent semantic and swapping the code leads to appearance changes in the corresponding semantic entity that the latent token focuses on.
  • Figure 5: Prompts Used for Recaptioning. One of four prompts is used to recaption each image, where {original_caption} is replaced with the original image caption.
  • ...and 11 more figures