Table of Contents
Fetching ...

Texture Vector-Quantization and Reconstruction Aware Prediction for Generative Super-Resolution

Qifan Li, Jiale Zou, Jinhua Zhang, Wei Long, Xingyu Zhou, Shuhang Gu

TL;DR

The paper tackles the inefficiencies of vector-quantized priors in generative super-resolution by introducing Texture Vector Quantization (TVQ), which concentrates discrete modeling on texture while disentangling structure, and Reconstruction Aware Prediction (RAP), which trains the index predictor using image-level reconstruction loss via a straight-through estimator. TVQ reduces the codebook complexity and quantization error, while RAP aligns predictor optimization with perceptual image quality rather than code-level accuracy alone. Together, TVQ&RAP achieve state-of-the-art perceptual SR results with lower computational cost across synthetic and real-world datasets, with extensive ablations validating each component. This approach offers a practical, efficient path for high-fidelity, texture-rich SR in real-world applications.

Abstract

Vector-quantized based models have recently demonstrated strong potential for visual prior modeling. However, existing VQ-based methods simply encode visual features with nearest codebook items and train index predictor with code-level supervision. Due to the richness of visual signal, VQ encoding often leads to large quantization error. Furthermore, training predictor with code-level supervision can not take the final reconstruction errors into consideration, result in sub-optimal prior modeling accuracy. In this paper we address the above two issues and propose a Texture Vector-Quantization and a Reconstruction Aware Prediction strategy. The texture vector-quantization strategy leverages the task character of super-resolution and only introduce codebook to model the prior of missing textures. While the reconstruction aware prediction strategy makes use of the straight-through estimator to directly train index predictor with image-level supervision. Our proposed generative SR model (TVQ&RAP) is able to deliver photo-realistic SR results with small computational cost.

Texture Vector-Quantization and Reconstruction Aware Prediction for Generative Super-Resolution

TL;DR

The paper tackles the inefficiencies of vector-quantized priors in generative super-resolution by introducing Texture Vector Quantization (TVQ), which concentrates discrete modeling on texture while disentangling structure, and Reconstruction Aware Prediction (RAP), which trains the index predictor using image-level reconstruction loss via a straight-through estimator. TVQ reduces the codebook complexity and quantization error, while RAP aligns predictor optimization with perceptual image quality rather than code-level accuracy alone. Together, TVQ&RAP achieve state-of-the-art perceptual SR results with lower computational cost across synthetic and real-world datasets, with extensive ablations validating each component. This approach offers a practical, efficient path for high-fidelity, texture-rich SR in real-world applications.

Abstract

Vector-quantized based models have recently demonstrated strong potential for visual prior modeling. However, existing VQ-based methods simply encode visual features with nearest codebook items and train index predictor with code-level supervision. Due to the richness of visual signal, VQ encoding often leads to large quantization error. Furthermore, training predictor with code-level supervision can not take the final reconstruction errors into consideration, result in sub-optimal prior modeling accuracy. In this paper we address the above two issues and propose a Texture Vector-Quantization and a Reconstruction Aware Prediction strategy. The texture vector-quantization strategy leverages the task character of super-resolution and only introduce codebook to model the prior of missing textures. While the reconstruction aware prediction strategy makes use of the straight-through estimator to directly train index predictor with image-level supervision. Our proposed generative SR model (TVQ&RAP) is able to deliver photo-realistic SR results with small computational cost.

Paper Structure

This paper contains 28 sections, 9 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Vanilla VQ vs. Texture VQ. Vanilla VQ directly encode the entire visual feature space, a large codebook is required to capture complex combinations of structure and texture information. Our Texture VQ focuses on modling textures absent in LR inputs, thereby mitigating the difficulty of visual encoding for generative super-resolution. Notably, TVQ achieves significantly better reconstruction performance than the vanilla method across a range of codebook sizes. Experimental details can be found in Section \ref{['4.3']}.
  • Figure 2: (a) Code-level loss ignores the visual impacts caused by the predicting results and penalizes all non-ground-truth predictions equally. (b) Our reconstruction-aware training strategy guides the predictor according to the visual impacts introduced by different code predictions.
  • Figure 3: Overview of the proposed Texture Vector Quantization (TVQ) and Reconstruction Aware Prediction (RAP) strategies. (a) Texture Vector Quantization, we decompose the image into the structure and texture components, and only exploit codebook to generate discrete texture features; removing the structure component could significantly reduce the complexity of visual feature space, result in enhanced texture representation accuracy. (b) Reconstruction Aware Prediction, instead of training predictor through indirect code-level supervision, we introduce image-level supervision which take the reconstruction error lead by different predicting results into consideration; the predictor is trained to select codebook items for generating high-quality image details.
  • Figure 4: Qualitative comparison between different methods on ImageNet-Test dataset.
  • Figure 5: Qualitative comparison between different methods on two real-world datasets.
  • ...and 8 more figures