Table of Contents
Fetching ...

LG-VQ: Language-Guided Codebook Learning

Guotao Liang, Baoquan Zhang, Yaowei Wang, Xutao Li, Yunming Ye, Huaibin Wang, Chuyao Luo, Kola Ye, linfeng Luo

TL;DR

This paper proposes a novel language-guided codebook learning framework, called LG-VQ, which aims to learn a codebook that can be aligned with the text to improve the performance of multi-modal downstream tasks.

Abstract

Vector quantization (VQ) is a key technique in high-resolution and high-fidelity image synthesis, which aims to learn a codebook to encode an image with a sequence of discrete codes and then generate an image in an auto-regression manner. Although existing methods have shown superior performance, most methods prefer to learn a single-modal codebook (\emph{e.g.}, image), resulting in suboptimal performance when the codebook is applied to multi-modal downstream tasks (\emph{e.g.}, text-to-image, image captioning) due to the existence of modal gaps. In this paper, we propose a novel language-guided codebook learning framework, called LG-VQ, which aims to learn a codebook that can be aligned with the text to improve the performance of multi-modal downstream tasks. Specifically, we first introduce pre-trained text semantics as prior knowledge, then design two novel alignment modules (\emph{i.e.}, Semantic Alignment Module, and Relationship Alignment Module) to transfer such prior knowledge into codes for achieving codebook text alignment. In particular, our LG-VQ method is model-agnostic, which can be easily integrated into existing VQ models. Experimental results show that our method achieves superior performance on reconstruction and various multi-modal downstream tasks.

LG-VQ: Language-Guided Codebook Learning

TL;DR

This paper proposes a novel language-guided codebook learning framework, called LG-VQ, which aims to learn a codebook that can be aligned with the text to improve the performance of multi-modal downstream tasks.

Abstract

Vector quantization (VQ) is a key technique in high-resolution and high-fidelity image synthesis, which aims to learn a codebook to encode an image with a sequence of discrete codes and then generate an image in an auto-regression manner. Although existing methods have shown superior performance, most methods prefer to learn a single-modal codebook (\emph{e.g.}, image), resulting in suboptimal performance when the codebook is applied to multi-modal downstream tasks (\emph{e.g.}, text-to-image, image captioning) due to the existence of modal gaps. In this paper, we propose a novel language-guided codebook learning framework, called LG-VQ, which aims to learn a codebook that can be aligned with the text to improve the performance of multi-modal downstream tasks. Specifically, we first introduce pre-trained text semantics as prior knowledge, then design two novel alignment modules (\emph{i.e.}, Semantic Alignment Module, and Relationship Alignment Module) to transfer such prior knowledge into codes for achieving codebook text alignment. In particular, our LG-VQ method is model-agnostic, which can be easily integrated into existing VQ models. Experimental results show that our method achieves superior performance on reconstruction and various multi-modal downstream tasks.
Paper Structure (23 sections, 9 equations, 19 figures, 13 tables)

This paper contains 23 sections, 9 equations, 19 figures, 13 tables.

Figures (19)

  • Figure 1: To answer the question, one not only needs to identify "women" and "racket" but also understand the semantic relationship between them ("holding").
  • Figure 2: The overall architecture of the proposed LG-VQ method. The right part of the figure is the basic VQ-VAE module, left is our language-guided module, which consists of three losses: global semantic alignment ($\mathcal{L}_{gsa}$), masked text prediction ($\mathcal{L}_{mtp}$), and relationship alignment supervision ($\mathcal{L}_{ras}$). Here, pre-trained text information guides discrete code tokens of an image to learning rich semantic knowledge based on three losses.
  • Figure 3: Illustration of relationship alignment module, we use $Z_{vt}$ to align with two words, then inject the semantic relationship of two words into $Z$ codes.
  • Figure 4: Examples of the top-1 most similar text selected on image-to-text retrieval task.
  • Figure 5: Examples of the top-1 word predicted on masked word prediction task.
  • ...and 14 more figures