Table of Contents
Fetching ...

LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolution

Chenyang Wang, Wenjie An, Kui Jiang, Xianming Liu, Junjun Jiang

TL;DR

This work proposes a new framework called LLV-FSR, which marries the power of large vision-language model and higher-order visual prior with the challenging task of FSR, and introduces the pre-trained vision-language model to generate pluralistic priors, involving the image caption, descriptions, face semantic mask and depths.

Abstract

Existing face super-resolution (FSR) methods have made significant advancements, but they primarily super-resolve face with limited visual information, original pixel-wise space in particular, commonly overlooking the pluralistic clues, like the higher-order depth and semantics, as well as non-visual inputs (text caption and description). Consequently, these methods struggle to produce a unified and meaningful representation from the input face. We suppose that introducing the language-vision pluralistic representation into unexplored potential embedding space could enhance FSR by encoding and exploiting the complementarity across language-vision prior. This motivates us to propose a new framework called LLV-FSR, which marries the power of large vision-language model and higher-order visual prior with the challenging task of FSR. Specifically, besides directly absorbing knowledge from original input, we introduce the pre-trained vision-language model to generate pluralistic priors, involving the image caption, descriptions, face semantic mask and depths. These priors are then employed to guide the more critical feature representation, facilitating realistic and high-quality face super-resolution. Experimental results demonstrate that our proposed framework significantly improves both the reconstruction quality and perceptual quality, surpassing the SOTA by 0.43dB in terms of PSNR on the MMCelebA-HQ dataset.

LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolution

TL;DR

This work proposes a new framework called LLV-FSR, which marries the power of large vision-language model and higher-order visual prior with the challenging task of FSR, and introduces the pre-trained vision-language model to generate pluralistic priors, involving the image caption, descriptions, face semantic mask and depths.

Abstract

Existing face super-resolution (FSR) methods have made significant advancements, but they primarily super-resolve face with limited visual information, original pixel-wise space in particular, commonly overlooking the pluralistic clues, like the higher-order depth and semantics, as well as non-visual inputs (text caption and description). Consequently, these methods struggle to produce a unified and meaningful representation from the input face. We suppose that introducing the language-vision pluralistic representation into unexplored potential embedding space could enhance FSR by encoding and exploiting the complementarity across language-vision prior. This motivates us to propose a new framework called LLV-FSR, which marries the power of large vision-language model and higher-order visual prior with the challenging task of FSR. Specifically, besides directly absorbing knowledge from original input, we introduce the pre-trained vision-language model to generate pluralistic priors, involving the image caption, descriptions, face semantic mask and depths. These priors are then employed to guide the more critical feature representation, facilitating realistic and high-quality face super-resolution. Experimental results demonstrate that our proposed framework significantly improves both the reconstruction quality and perceptual quality, surpassing the SOTA by 0.43dB in terms of PSNR on the MMCelebA-HQ dataset.

Paper Structure

This paper contains 15 sections, 2 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Comparison of FSR framework. The green part denotes existing FSR methods super-resoling face with original LR input while the red part presents our LLV-FSR generates high-quality output with language-vision prior.
  • Figure 2: Overview of the proposed framework. Our method first inputs the LR face image into the pretrained large-scale model to extract language-vision prior and then exploit the prior information for improving face image quality.
  • Figure 3: Framework of language-vision prior fusion block. (a): LVPFB; (b): SegA and DepA; (c): CapA; (d): DesA.
  • Figure 4: Language-vision prior visualization. (a): HR; (b): LR; (c): Semantic mask; (d): Depth; (e): Caption; (f): Description.
  • Figure 5: $\times$8 FSR results of state-of-the-art methods on MMCelebA-HQ dataset. (a): LR; (b): SRCNN; (c): FSRNet; (d): DIC; (e): SISN; (f): FaceFormer; (g): SFMNet; (h): WFEN; (i): LLV-FSR; (j): HR.
  • ...and 4 more figures