Table of Contents
Fetching ...

VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling

Qian Zhang, Xiangzi Dai, Ninghua Yang, Xiang An, Ziyong Feng, Xingyu Ren

TL;DR

This work presents VAR-CLIP, a text-to-image framework that extends Visual Auto-Regressive modeling with CLIP-conditioned guidance. It employs a two-stage training pipeline combining a multi-scale VQVAE tokenizer and a conditional VAR, with BLIP-2-generated captions enabling ImageNet-scale training. A key finding is that the initial CLIP tokens disproportionately influence embeddings, informing caption guidance strategies. Empirically, VAR-CLIP achieves high-fidelity, semantically aligned images with efficient inference, while acknowledging artifacts and alignment challenges that motivate future captioning and guidance refinements.

Abstract

VAR is a new generation paradigm that employs 'next-scale prediction' as opposed to 'next-token prediction'. This innovative transformation enables auto-regressive (AR) transformers to rapidly learn visual distributions and achieve robust generalization. However, the original VAR model is constrained to class-conditioned synthesis, relying solely on textual captions for guidance. In this paper, we introduce VAR-CLIP, a novel text-to-image model that integrates Visual Auto-Regressive techniques with the capabilities of CLIP. The VAR-CLIP framework encodes captions into text embeddings, which are then utilized as textual conditions for image generation. To facilitate training on extensive datasets, such as ImageNet, we have constructed a substantial image-text dataset leveraging BLIP2. Furthermore, we delve into the significance of word positioning within CLIP for the purpose of caption guidance. Extensive experiments confirm VAR-CLIP's proficiency in generating fantasy images with high fidelity, textual congruence, and aesthetic excellence. Our project page are https://github.com/daixiangzi/VAR-CLIP

VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling

TL;DR

This work presents VAR-CLIP, a text-to-image framework that extends Visual Auto-Regressive modeling with CLIP-conditioned guidance. It employs a two-stage training pipeline combining a multi-scale VQVAE tokenizer and a conditional VAR, with BLIP-2-generated captions enabling ImageNet-scale training. A key finding is that the initial CLIP tokens disproportionately influence embeddings, informing caption guidance strategies. Empirically, VAR-CLIP achieves high-fidelity, semantically aligned images with efficient inference, while acknowledging artifacts and alignment challenges that motivate future captioning and guidance refinements.

Abstract

VAR is a new generation paradigm that employs 'next-scale prediction' as opposed to 'next-token prediction'. This innovative transformation enables auto-regressive (AR) transformers to rapidly learn visual distributions and achieve robust generalization. However, the original VAR model is constrained to class-conditioned synthesis, relying solely on textual captions for guidance. In this paper, we introduce VAR-CLIP, a novel text-to-image model that integrates Visual Auto-Regressive techniques with the capabilities of CLIP. The VAR-CLIP framework encodes captions into text embeddings, which are then utilized as textual conditions for image generation. To facilitate training on extensive datasets, such as ImageNet, we have constructed a substantial image-text dataset leveraging BLIP2. Furthermore, we delve into the significance of word positioning within CLIP for the purpose of caption guidance. Extensive experiments confirm VAR-CLIP's proficiency in generating fantasy images with high fidelity, textual congruence, and aesthetic excellence. Our project page are https://github.com/daixiangzi/VAR-CLIP
Paper Structure (11 sections, 8 equations, 4 figures)

This paper contains 11 sections, 8 equations, 4 figures.

Figures (4)

  • Figure 1: An illustration of VAR-CLIP. For a given text prompt and image, VAR-CLIP generates text embeddings from a pre-trained CLIP model and visual embeddings from a VAR encoder. The text embedding serves as a condition to guide the generation of multi-scale tokens and the final image. The Visual Autoregressive Transformer (VAR) generates these multi-scale tokens through next-scale prediction. During training, we utilize BLIP-2 to obtain text captions.
  • Figure 2: Generate samples based on ten text captions trained on the ImageNet dataset, resembling those generated from BLIP-2.
  • Figure 3: Failure cases. Our method can produce noticeable artifacts in the image.
  • Figure 4: Clip position score. The different positions in a sentence have varying impacts on the weight of the sentence.