Table of Contents
Fetching ...

GVTNet: Graph Vision Transformer For Face Super-Resolution

Chao Yang, Yong Fan, Cheng Lu, Minghao Yuan, Zhijing Yang

TL;DR

GVTNet tackles face super-resolution by rethinking ViT patch processing as a graph problem: patches are graph nodes connected by a Minkowski-distance-based adjacency $A$ with entries $A_{ij}=1$ if $D_{min}(oldsymbol{z_i}, oldsymbol{z_j}) > T$, where $D_{min}(oldsymbol{z_i}, oldsymbol{z_j}) = \big(\sum_{k=1}^{n} |z_{i,k}-z_{j,k}|^{p}\bigr)^{1/p}$; this adjacency masks attention so that only neighboring patches influence each patch. The architecture uses deep GVT groups with an Adjacency Update Module and a Dual Modeling Block (DMB) that alternates Graph Vision Transformer and Swin Transformer layers to balance locality and global context. Empirical results on CelebA and HELEN show that GVTNet achieves higher PSNR/SSIM and better visual fidelity for facial components, with ablations confirming the contributions of the adjacency scheme and the dual-aggregation strategy, while maintaining a compact model. Overall, the work introduces a graph-informed transformer paradigm for face SR, offering improved detail recovery and efficient performance suitable for real-world applications.

Abstract

Recent advances in face super-resolution research have utilized the Transformer architecture. This method processes the input image into a series of small patches. However, because of the strong correlation between different facial components in facial images. When it comes to super-resolution of low-resolution images, existing algorithms cannot handle the relationships between patches well, resulting in distorted facial components in the super-resolution results. To solve the problem, we propose a transformer architecture based on graph neural networks called graph vision transformer network. We treat each patch as a graph node and establish an adjacency matrix based on the information between patches. In this way, the patch only interacts between neighboring patches, further processing the relationship of facial components. Quantitative and visualization experiments have underscored the superiority of our algorithm over state-of-the-art techniques. Through detailed comparisons, we have demonstrated that our algorithm possesses more advanced super-resolution capabilities, particularly in enhancing facial components. The PyTorch code is available at https://github.com/continueyang/GVTNet

GVTNet: Graph Vision Transformer For Face Super-Resolution

TL;DR

GVTNet tackles face super-resolution by rethinking ViT patch processing as a graph problem: patches are graph nodes connected by a Minkowski-distance-based adjacency with entries if , where ; this adjacency masks attention so that only neighboring patches influence each patch. The architecture uses deep GVT groups with an Adjacency Update Module and a Dual Modeling Block (DMB) that alternates Graph Vision Transformer and Swin Transformer layers to balance locality and global context. Empirical results on CelebA and HELEN show that GVTNet achieves higher PSNR/SSIM and better visual fidelity for facial components, with ablations confirming the contributions of the adjacency scheme and the dual-aggregation strategy, while maintaining a compact model. Overall, the work introduces a graph-informed transformer paradigm for face SR, offering improved detail recovery and efficient performance suitable for real-world applications.

Abstract

Recent advances in face super-resolution research have utilized the Transformer architecture. This method processes the input image into a series of small patches. However, because of the strong correlation between different facial components in facial images. When it comes to super-resolution of low-resolution images, existing algorithms cannot handle the relationships between patches well, resulting in distorted facial components in the super-resolution results. To solve the problem, we propose a transformer architecture based on graph neural networks called graph vision transformer network. We treat each patch as a graph node and establish an adjacency matrix based on the information between patches. In this way, the patch only interacts between neighboring patches, further processing the relationship of facial components. Quantitative and visualization experiments have underscored the superiority of our algorithm over state-of-the-art techniques. Through detailed comparisons, we have demonstrated that our algorithm possesses more advanced super-resolution capabilities, particularly in enhancing facial components. The PyTorch code is available at https://github.com/continueyang/GVTNet

Paper Structure

This paper contains 9 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The left image shows VIT's sequential structure, converting a 2D image into patches. On the right, our proposed structure links patches as graph nodes.
  • Figure 2: The overall structure of our proposed algorithm. The left side is the overall framework of GVTNet, and the right side is the internal structure of DMB module.
  • Figure 3: The attention mechanism of our proposed GVTNet is compared with the internal structure of the traditional attention mechanism in SwinIRswinir. The left side is the traditional attention, and the right side is our proposed G-WSA.
  • Figure 4: The visual analysis results of LAMLAM , The DI value represents the magnitude of the range of information used by the model to recover the target.
  • Figure 5: Test results visualized at X8 super-resolution on the Celebaceleba test set.
  • ...and 1 more figures