Texture Vector-Quantization and Reconstruction Aware Prediction for Generative Super-Resolution
Qifan Li, Jiale Zou, Jinhua Zhang, Wei Long, Xingyu Zhou, Shuhang Gu
TL;DR
The paper tackles the inefficiencies of vector-quantized priors in generative super-resolution by introducing Texture Vector Quantization (TVQ), which concentrates discrete modeling on texture while disentangling structure, and Reconstruction Aware Prediction (RAP), which trains the index predictor using image-level reconstruction loss via a straight-through estimator. TVQ reduces the codebook complexity and quantization error, while RAP aligns predictor optimization with perceptual image quality rather than code-level accuracy alone. Together, TVQ&RAP achieve state-of-the-art perceptual SR results with lower computational cost across synthetic and real-world datasets, with extensive ablations validating each component. This approach offers a practical, efficient path for high-fidelity, texture-rich SR in real-world applications.
Abstract
Vector-quantized based models have recently demonstrated strong potential for visual prior modeling. However, existing VQ-based methods simply encode visual features with nearest codebook items and train index predictor with code-level supervision. Due to the richness of visual signal, VQ encoding often leads to large quantization error. Furthermore, training predictor with code-level supervision can not take the final reconstruction errors into consideration, result in sub-optimal prior modeling accuracy. In this paper we address the above two issues and propose a Texture Vector-Quantization and a Reconstruction Aware Prediction strategy. The texture vector-quantization strategy leverages the task character of super-resolution and only introduce codebook to model the prior of missing textures. While the reconstruction aware prediction strategy makes use of the straight-through estimator to directly train index predictor with image-level supervision. Our proposed generative SR model (TVQ&RAP) is able to deliver photo-realistic SR results with small computational cost.
