Table of Contents
Fetching ...

Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling

Baoquan Zhang, Huaibin Wang, Luo Chuyao, Xutao Li, Liang Guotao, Yunming Ye, Xiaochen Qi, Yao He

TL;DR

This paper proposes a novel codebook transfer framework with part-of-speech, called VQCT, which aims to transfer a well-trained codebook from pretrained language models to VQIM for robust codebook learning, and introduces a pretrained codebook from language models and part-of-speech knowledge as priors.

Abstract

Vector-Quantized Image Modeling (VQIM) is a fundamental research problem in image synthesis, which aims to represent an image with a discrete token sequence. Existing studies effectively address this problem by learning a discrete codebook from scratch and in a code-independent manner to quantize continuous representations into discrete tokens. However, learning a codebook from scratch and in a code-independent manner is highly challenging, which may be a key reason causing codebook collapse, i.e., some code vectors can rarely be optimized without regard to the relationship between codes and good codebook priors such that die off finally. In this paper, inspired by pretrained language models, we find that these language models have actually pretrained a superior codebook via a large number of text corpus, but such information is rarely exploited in VQIM. To this end, we propose a novel codebook transfer framework with part-of-speech, called VQCT, which aims to transfer a well-trained codebook from pretrained language models to VQIM for robust codebook learning. Specifically, we first introduce a pretrained codebook from language models and part-of-speech knowledge as priors. Then, we construct a vision-related codebook with these priors for achieving codebook transfer. Finally, a novel codebook transfer network is designed to exploit abundant semantic relationships between codes contained in pretrained codebooks for robust VQIM codebook learning. Experimental results on four datasets show that our VQCT method achieves superior VQIM performance over previous state-of-the-art methods.

Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling

TL;DR

This paper proposes a novel codebook transfer framework with part-of-speech, called VQCT, which aims to transfer a well-trained codebook from pretrained language models to VQIM for robust codebook learning, and introduces a pretrained codebook from language models and part-of-speech knowledge as priors.

Abstract

Vector-Quantized Image Modeling (VQIM) is a fundamental research problem in image synthesis, which aims to represent an image with a discrete token sequence. Existing studies effectively address this problem by learning a discrete codebook from scratch and in a code-independent manner to quantize continuous representations into discrete tokens. However, learning a codebook from scratch and in a code-independent manner is highly challenging, which may be a key reason causing codebook collapse, i.e., some code vectors can rarely be optimized without regard to the relationship between codes and good codebook priors such that die off finally. In this paper, inspired by pretrained language models, we find that these language models have actually pretrained a superior codebook via a large number of text corpus, but such information is rarely exploited in VQIM. To this end, we propose a novel codebook transfer framework with part-of-speech, called VQCT, which aims to transfer a well-trained codebook from pretrained language models to VQIM for robust codebook learning. Specifically, we first introduce a pretrained codebook from language models and part-of-speech knowledge as priors. Then, we construct a vision-related codebook with these priors for achieving codebook transfer. Finally, a novel codebook transfer network is designed to exploit abundant semantic relationships between codes contained in pretrained codebooks for robust VQIM codebook learning. Experimental results on four datasets show that our VQCT method achieves superior VQIM performance over previous state-of-the-art methods.
Paper Structure (15 sections, 3 equations, 8 figures, 6 tables)

This paper contains 15 sections, 3 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Existing language models actually have provided a superior codebook, which contains abundant semantic relationships between codes and where some vision-related tokens (adjective and noun) can be well transferred to VQIM. For example, the vision-similar "orange" and "yellow" are indeed closer than dissimilar "yellow" and "blue" in adjective space (a); In noun space (b), some vision-similar parts like "breast" and "belly" are also closer than dissimilar "breast" and "wheel". Resorting to such relationships, VQIM codebook collapse can be alleviated, e.g., although the ”orange” code is not selected to optimize, but its code vector can also be well learned with its close relationship to the ”yellow” code.
  • Figure 2: Illustration of our codebook transfer framework with part-of-speech, i.e., VQCT, which consists of an encoder, a codebook transfer module, and a decoder. Here, the encoder aims to represent an image as a set of spatial continuous vectors. Then, the codebook transfer module is employed to generate a codebook in a transfer manner from pretrained language models (PLM) to VQIM and quantize the continuous vector into a set of quantized vectors. Finally, the decoder is used to reconstruct original images with the quantized vectors.
  • Figure 3: Illustration of codebook optimization. Here, we take a two-dimensional toy setting as an example to show distribution change of codebook when performing optimization. Different from VQ-VAE (a) that only the active “lucky” seeds (in Peach) are optimized but the other “dead” vectors (in Red) are not optimized and remain fixed, our VQCT update all code vectors in the codebook, although only an active code vector is selected in the codebook to perform optimization, with the abundant semantic relationships from pretrained codebook.
  • Figure 4: Reconstructed images from different VQIM methods on four datasets. Here, the red-color boxes highlight reconstruction details.
  • Figure 5: Visualization of codebook utilization on CUB-200.
  • ...and 3 more figures