Table of Contents
Fetching ...

Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, Alexander Toshev

TL;DR

It is demonstrated that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones, which causes the catastrophic degradation of language models’ capability.

Abstract

Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models' capability.

Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

TL;DR

It is demonstrated that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones, which causes the catastrophic degradation of language models’ capability.

Abstract

Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models' capability.
Paper Structure (12 sections, 7 figures, 1 table)

This paper contains 12 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Auto-regressive and diffusion based models achieve similar performances on text-to-image generation. However, while all the diffusion models leverage pre-trained language models, all the auto-regressive models do not.
  • Figure 2: Adapting language models for auto-regressive text-to-image generation.(Left) An image is fed into an image tokenizer (MoVQGAN zheng2022movq) and converted to a grid of discrete tokens, and it can be well-reconstructed with these image tokens. (Right) As images are converted to tokens similar to text tokens, we can enable language models to generate images by adapting its embedding layer and output layer.
  • Figure 3: Pre-trained language models do not help auto-regressive text-to-image generation. Models are trained on the HQITP-134M image-caption dataset with 64 A100 80GB GPUs using batch size 1M tokens. EMA is Exponential Moving Average.
  • Figure 4: Break-down loss on image and text tokens. Models are trained on the HQITP-134M image-caption dataset with 64 A100 80GB GPUs using batch size 1M tokens.
  • Figure 5: Examples of generated images. We achieve 12.21 FID on MS-COCO at the end of training.
  • ...and 2 more figures