Table of Contents
Fetching ...

Image and Video Quality Assessment using Prompt-Guided Latent Diffusion Models for Cross-Dataset Generalization

Shankhanil Mitra, Diptanu De, Shika Rao, Rajiv Soundararajan

TL;DR

The paper tackles the challenge of cross-dataset generalization for no-reference image and video quality assessment. It introduces GenzIQA and GenzVQA, which leverage prompt-guided latent diffusion models with learnable cross-attention between image/video representations and quality-aware textual prompts, plus a temporal quality modulator to handle motion in videos. The authors demonstrate superior cross-database performance across a broad suite of IQA/VQA datasets, perform extensive ablations to pinpoint the contributions of cross-attention, prompts, and pooling, and show practical inference times. The approach enables robust QA under diverse distortions and content types, with potential for scalable deployment and further improvements via edge-distillation and faster diffusion techniques.

Abstract

The design of image and video quality assessment (QA) algorithms is extremely important to benchmark and calibrate user experience in modern visual systems. A major drawback of the state-of-the-art QA methods is their limited ability to generalize across diverse image and video datasets with reasonable distribution shifts. In this work, we leverage the denoising process of diffusion models for generalized image QA (IQA) and video QA (VQA) by understanding the degree of alignment between learnable quality-aware text prompts and images or video frames. In particular, we learn cross-attention maps from intermediate layers of the denoiser of latent diffusion models (LDMs) to capture quality-aware representations of images or video frames. Since applying text-to-image LDMs for every video frame is computationally expensive for videos, we only estimate the quality of a frame-rate sub-sampled version of the original video. To compensate for the loss in motion information due to frame-rate sub-sampling, we propose a novel temporal quality modulator. Our extensive cross-database experiments across various user-generated, synthetic, low-light, frame-rate variation, ultra high definition, and streaming content-based databases show that our model can achieve superior generalization in both IQA and VQA.

Image and Video Quality Assessment using Prompt-Guided Latent Diffusion Models for Cross-Dataset Generalization

TL;DR

The paper tackles the challenge of cross-dataset generalization for no-reference image and video quality assessment. It introduces GenzIQA and GenzVQA, which leverage prompt-guided latent diffusion models with learnable cross-attention between image/video representations and quality-aware textual prompts, plus a temporal quality modulator to handle motion in videos. The authors demonstrate superior cross-database performance across a broad suite of IQA/VQA datasets, perform extensive ablations to pinpoint the contributions of cross-attention, prompts, and pooling, and show practical inference times. The approach enables robust QA under diverse distortions and content types, with potential for scalable deployment and further improvements via edge-distillation and faster diffusion techniques.

Abstract

The design of image and video quality assessment (QA) algorithms is extremely important to benchmark and calibrate user experience in modern visual systems. A major drawback of the state-of-the-art QA methods is their limited ability to generalize across diverse image and video datasets with reasonable distribution shifts. In this work, we leverage the denoising process of diffusion models for generalized image QA (IQA) and video QA (VQA) by understanding the degree of alignment between learnable quality-aware text prompts and images or video frames. In particular, we learn cross-attention maps from intermediate layers of the denoiser of latent diffusion models (LDMs) to capture quality-aware representations of images or video frames. Since applying text-to-image LDMs for every video frame is computationally expensive for videos, we only estimate the quality of a frame-rate sub-sampled version of the original video. To compensate for the loss in motion information due to frame-rate sub-sampling, we propose a novel temporal quality modulator. Our extensive cross-database experiments across various user-generated, synthetic, low-light, frame-rate variation, ultra high definition, and streaming content-based databases show that our model can achieve superior generalization in both IQA and VQA.
Paper Structure (35 sections, 11 equations, 7 figures, 21 tables)

This paper contains 35 sections, 11 equations, 7 figures, 21 tables.

Figures (7)

  • Figure 1: Given an input image $x$ or video frame $u_i$, VQ-VAE processes it to the latent $z_0$. The noisy latent output $z_t$ of the forward diffusion is fed to the denoising UNet unet$\epsilon_{\theta}(\cdot)$. At every cross-attention block in $\epsilon_{\theta}(\cdot)$, the intermediate visual representation is aligned with learnable text representations $\{\tau_\theta(y_+),\tau_\theta(y_-) \}$. After that, the attention maps are pooled for each cross-attention block $p$ to predict block quality $q^p(x)$ or $q^p(u_i)$.
  • Figure 2: Framework of Temporal Quality Modulator. $\{\varphi_p(z_t^i)\}$ is the visual query feature of the UNet at the p$^{th}$ cross-attention block for a time-step $t$ across all sub-sampled frames $i \in \{ 1,2, \cdots, T_s \}$. Slow-pathway and fast-pathway features $h_s(v_s)$ and $h_f(v)$ are extracted from frozen slow and fast pathway of pre-trained SlowFastNet. Temporal correction factor $\gamma_p$ is obtained by average pooling visual-motion cross-attention maps $A_s^{(p)}$ and $A_f^{(p)}$ across spatial dimension, then concatenating them and passing them through a single-layer neural network.
  • Figure 3: Generated images from zero shot SDM. In Fig. \ref{['fig:clean_generated']} image is generated without noise infused to the image latent, Fig. \ref{['fig:100_generated']} and Fig. \ref{['fig:1000_generated']} are generated images with noise fed at the timestep $t = 95$ (low noise), and $t = 950$ (high noise) respectively and subsequently denoised. Lower LPIPS scores correspond to better perceptual quality.
  • Figure 4: SRCC performance variation of GenzIQA trained on CLIVE and tested across four databases at different timesteps.
  • Figure 5: 2D tSNE visualization of cross-attention features of GenzIQA trained on FLIVE and tested on (a) Gaussian blur, (b) White noise, and (c) JPEG compressed images from LIVE-IQA.
  • ...and 2 more figures