Table of Contents
Fetching ...

BRICS: Bi-level feature Representation of Image CollectionS

Dingdong Yang, Yizhi Wang, Ali Mahdavi-Amiri, Hao Zhang

TL;DR

BRICS introduces a bi-level image representation that encodes images into continuous key codes and retrieves features from multi-resolution grids, enabling diffusion modeling directly on the key-code space. The approach jointly trains the encoder, feature grids, and decoder, leveraging relaxed data precision and strictly bounded key-code variance to improve diffusion training and reconstruction efficiency. Empirical results show competitive reconstruction with a compact decoder and state-of-the-art generation on FFHQ and LSUN-Church, surpassing several VQ-based and latent-diffusion baselines in CLIP-FID and related metrics. BRICS thus provides a scalable, efficient framework for dataset-scale image representation and diffusion-based synthesis, with potential extensions to other data types and modalities.

Abstract

We present BRICS, a bi-level feature representation for image collections, which consists of a key code space on top of a feature grid space. Specifically, our representation is learned by an autoencoder to encode images into continuous key codes, which are used to retrieve features from groups of multi-resolution feature grids. Our key codes and feature grids are jointly trained continuously with well-defined gradient flows, leading to high usage rates of the feature grids and improved generative modeling compared to discrete Vector Quantization (VQ). Differently from existing continuous representations such as KL-regularized latent codes, our key codes are strictly bounded in scale and variance. Overall, feature encoding by BRICS is compact, efficient to train, and enables generative modeling over key codes using the diffusion model. Experimental results show that our method achieves comparable reconstruction results to VQ while having a smaller and more efficient decoder network (50% fewer GFlops). By applying the diffusion model over our key code space, we achieve state-of-the-art performance on image synthesis on the FFHQ and LSUN-Church (29% lower than LDM, 32% lower than StyleGAN2, 44% lower than Projected GAN on CLIP-FID) datasets.

BRICS: Bi-level feature Representation of Image CollectionS

TL;DR

BRICS introduces a bi-level image representation that encodes images into continuous key codes and retrieves features from multi-resolution grids, enabling diffusion modeling directly on the key-code space. The approach jointly trains the encoder, feature grids, and decoder, leveraging relaxed data precision and strictly bounded key-code variance to improve diffusion training and reconstruction efficiency. Empirical results show competitive reconstruction with a compact decoder and state-of-the-art generation on FFHQ and LSUN-Church, surpassing several VQ-based and latent-diffusion baselines in CLIP-FID and related metrics. BRICS thus provides a scalable, efficient framework for dataset-scale image representation and diffusion-based synthesis, with potential extensions to other data types and modalities.

Abstract

We present BRICS, a bi-level feature representation for image collections, which consists of a key code space on top of a feature grid space. Specifically, our representation is learned by an autoencoder to encode images into continuous key codes, which are used to retrieve features from groups of multi-resolution feature grids. Our key codes and feature grids are jointly trained continuously with well-defined gradient flows, leading to high usage rates of the feature grids and improved generative modeling compared to discrete Vector Quantization (VQ). Differently from existing continuous representations such as KL-regularized latent codes, our key codes are strictly bounded in scale and variance. Overall, feature encoding by BRICS is compact, efficient to train, and enables generative modeling over key codes using the diffusion model. Experimental results show that our method achieves comparable reconstruction results to VQ while having a smaller and more efficient decoder network (50% fewer GFlops). By applying the diffusion model over our key code space, we achieve state-of-the-art performance on image synthesis on the FFHQ and LSUN-Church (29% lower than LDM, 32% lower than StyleGAN2, 44% lower than Projected GAN on CLIP-FID) datasets.
Paper Structure (34 sections, 9 equations, 23 figures, 10 tables)

This paper contains 34 sections, 9 equations, 23 figures, 10 tables.

Figures (23)

  • Figure 1: Image reconstruction results by BRICS (row 2) on the validation set of LSUN-church $256 \times 256$yu2015lsun and FFHQ $256 \times 256$ datasets karras2019style, where we note the recovery of fine details such as text captions and watermarks. Bottom row shows images generated via the diffusion model trained on key codes from BRICS (zoom in to see more details)
  • Figure 2: Rather than directly encoding images into features, our method first projects images into key codes and then uses the key codes to retrieve features from groups of feature grids. The encoder, decoder, and feature grids are jointly learned via autoencoding.
  • Figure 3: Overall pipeline of our method in three parts: Encoding, Feature Retrieval and Decoding.
  • Figure 4: Comparisons to VQGAN esser2021taming demonstrate that our method consistently produces superior reconstruction quality. For optimal viewing and contrast, zoom in on highlighted regions corresponding to the yellow squares.
  • Figure 5: The generated results of diffusion model applied on KL-regularized key codes versus ours when using cosine noise scheduler with min-snr weighting strategy. KL-reg results show irregular spot artifacts.
  • ...and 18 more figures