SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook

Chenjing Ding; Chiyu Wang; Boshi Liu; Xi Guo; Weixuan Tang; Wei Wu

SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook

Chenjing Ding, Chiyu Wang, Boshi Liu, Xi Guo, Weixuan Tang, Wei Wu

TL;DR

This work introduces SGC-VQGAN through Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning, and constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics.

Abstract

Vector quantization (VQ) is a method for deterministically learning features through discrete codebook representations. Recent works have utilized visual tokenizers to discretize visual regions for self-supervised representation learning. However, a notable limitation of these tokenizers is lack of semantics, as they are derived solely from the pretext task of reconstructing raw image pixels in an auto-encoder paradigm. Additionally, issues like imbalanced codebook distribution and codebook collapse can adversely impact performance due to inefficient codebook utilization. To address these challenges, We introduce SGC-VQGAN through Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning. Utilizing inference results from segmentation model , our approach constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics. Our proposed Pyramid Feature Learning pipeline integrates multi-level features to capture both image details and semantics simultaneously. As a result, SGC-VQGAN achieves SOTA performance in both reconstruction quality and various downstream tasks. Its simplicity, requiring no additional parameter learning, enables its direct application in downstream tasks, presenting significant potential.

SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook

TL;DR

Abstract

Paper Structure (22 sections, 12 equations, 9 figures, 7 tables)

This paper contains 22 sections, 12 equations, 9 figures, 7 tables.

Introduction
Method
Preliminary
VQ-VAE
Online Clustering
Semantic Online Clustering
Consistent Semantic Learning
Multi-level Feature Learning
Modeling Prior Distribution
Image Generation
Video Prediction
Experiments
Experimental Details
Degree of Semantization
Unconditional Image Generation
...and 7 more sections

Figures (9)

Figure 1: Semantic Performance Comparison; (a) tsne visualize for SGC-VQGAN(ours); Our method successfully enhanced codebook semantic; (b) codebook usage of SGC-VQGAN(ours); Our method achieve balanced and efficient utilization of codebook with $100\%$ active tokens; (c) tsne visualize for VQGANvqgan; Most tokens are not used and the other tokens are with poor clustering; (d) codebook usage of VQGAN; few tokens are active ($<10\%$); (e) tsne visualize for CVQ-VAEzheng2023online; Most tokens are used, however these tokens are lack semantic significance; (f) semantic performance; Other methods lack specific semantic tokens, such as those representing human and vehicles, which are crucial for generating real-world scenarios. Our method successfully cover these classes.
Figure 2: We introduce a Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning; Utilizing segmentation model inference results, our approach constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics;We use Pyramid Feature Learning pipeline integrates multi-level features to capture both image details and semantics simultaneously. Its simplicity allows for direct application in downstream tasks, offering significant potential.
Figure 3: PCA visulaize for codebook; Up: we use pca to visualize our codebook; Bottom: the reference images.
Figure 4: PCA visualization of the codebook for comparison across different methods.
Figure 5: Video generation for complex scenarios. The numbers represent the video frame numbers.
...and 4 more figures

SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook

TL;DR

Abstract

SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook

Authors

TL;DR

Abstract

Table of Contents

Figures (9)