Table of Contents
Fetching ...

SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook

Chenjing Ding, Chiyu Wang, Boshi Liu, Xi Guo, Weixuan Tang, Wei Wu

TL;DR

This work introduces SGC-VQGAN through Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning, and constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics.

Abstract

Vector quantization (VQ) is a method for deterministically learning features through discrete codebook representations. Recent works have utilized visual tokenizers to discretize visual regions for self-supervised representation learning. However, a notable limitation of these tokenizers is lack of semantics, as they are derived solely from the pretext task of reconstructing raw image pixels in an auto-encoder paradigm. Additionally, issues like imbalanced codebook distribution and codebook collapse can adversely impact performance due to inefficient codebook utilization. To address these challenges, We introduce SGC-VQGAN through Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning. Utilizing inference results from segmentation model , our approach constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics. Our proposed Pyramid Feature Learning pipeline integrates multi-level features to capture both image details and semantics simultaneously. As a result, SGC-VQGAN achieves SOTA performance in both reconstruction quality and various downstream tasks. Its simplicity, requiring no additional parameter learning, enables its direct application in downstream tasks, presenting significant potential.

SGC-VQGAN: Towards Complex Scene Representation via Semantic Guided Clustering Codebook

TL;DR

This work introduces SGC-VQGAN through Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning, and constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics.

Abstract

Vector quantization (VQ) is a method for deterministically learning features through discrete codebook representations. Recent works have utilized visual tokenizers to discretize visual regions for self-supervised representation learning. However, a notable limitation of these tokenizers is lack of semantics, as they are derived solely from the pretext task of reconstructing raw image pixels in an auto-encoder paradigm. Additionally, issues like imbalanced codebook distribution and codebook collapse can adversely impact performance due to inefficient codebook utilization. To address these challenges, We introduce SGC-VQGAN through Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning. Utilizing inference results from segmentation model , our approach constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics. Our proposed Pyramid Feature Learning pipeline integrates multi-level features to capture both image details and semantics simultaneously. As a result, SGC-VQGAN achieves SOTA performance in both reconstruction quality and various downstream tasks. Its simplicity, requiring no additional parameter learning, enables its direct application in downstream tasks, presenting significant potential.
Paper Structure (22 sections, 12 equations, 9 figures, 7 tables)

This paper contains 22 sections, 12 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Semantic Performance Comparison; (a) tsne visualize for SGC-VQGAN(ours); Our method successfully enhanced codebook semantic; (b) codebook usage of SGC-VQGAN(ours); Our method achieve balanced and efficient utilization of codebook with $100\%$ active tokens; (c) tsne visualize for VQGANvqgan; Most tokens are not used and the other tokens are with poor clustering; (d) codebook usage of VQGAN; few tokens are active ($<10\%$); (e) tsne visualize for CVQ-VAEzheng2023online; Most tokens are used, however these tokens are lack semantic significance; (f) semantic performance; Other methods lack specific semantic tokens, such as those representing human and vehicles, which are crucial for generating real-world scenarios. Our method successfully cover these classes.
  • Figure 2: We introduce a Semantic Online Clustering method to enhance token semantics through Consistent Semantic Learning; Utilizing segmentation model inference results, our approach constructs a temporospatially consistent semantic codebook, addressing issues of codebook collapse and imbalanced token semantics;We use Pyramid Feature Learning pipeline integrates multi-level features to capture both image details and semantics simultaneously. Its simplicity allows for direct application in downstream tasks, offering significant potential.
  • Figure 3: PCA visulaize for codebook; Up: we use pca to visualize our codebook; Bottom: the reference images.
  • Figure 4: PCA visualization of the codebook for comparison across different methods.
  • Figure 5: Video generation for complex scenarios. The numbers represent the video frame numbers.
  • ...and 4 more figures