Table of Contents
Fetching ...

NovisVQ: A Streaming Convolutional Neural Network for No-Reference Opinion-Unaware Frame Quality Assessment

Kylie Cancilla, Alexander Moore, Amar Saini, Carmen Carrano

TL;DR

The paper addresses the challenge of video quality assessment (VQA) without clean references or human opinion labels by proposing NovisVQ, a streaming, no-reference, opinion-unaware model. Trained on synthetically degraded DAVIS videos, NovisVQ uses a temporal, multi-scale ResNet encoder with LSTM modules and a lightweight MLP to predict per-frame FR metrics $LPIPS$, $PSNR$, and $SSIM$ directly from degraded video. Compared to an image-based baseline, NovisVQ leverages temporal context to generalize to unseen degradations and real-world motion blur, achieving strong correlations with ground-truth FR metrics on GOPRO data and surpassing BRISQUE for this objective alignment. This work demonstrates scalable, self-supervised VQA suitable for real-time video processing in vision systems, without requiring pristine references or human annotations. It highlights the critical role of temporal modeling in robust VQA and points to future work on downstream task integration and broader degradation modeling.

Abstract

Video quality assessment (VQA) is vital for computer vision tasks, but existing approaches face major limitations: full-reference (FR) metrics require clean reference videos, and most no-reference (NR) models depend on training on costly human opinion labels. Moreover, most opinion-unaware NR methods are image-based, ignoring temporal context critical for video object detection. In this work, we present a scalable, streaming-based VQA model that is both no-reference and opinion-unaware. Our model leverages synthetic degradations of the DAVIS dataset, training a temporal-aware convolutional architecture to predict FR metrics (LPIPS , PSNR, SSIM) directly from degraded video, without references at inference. We show that our streaming approach outperforms our own image-based baseline by generalizing across diverse degradations, underscoring the value of temporal modeling for scalable VQA in real-world vision systems. Additionally, we demonstrate that our model achieves higher correlation with full-reference metrics compared to BRISQUE, a widely-used opinion-aware image quality assessment baseline, validating the effectiveness of our temporal, opinion-unaware approach.

NovisVQ: A Streaming Convolutional Neural Network for No-Reference Opinion-Unaware Frame Quality Assessment

TL;DR

The paper addresses the challenge of video quality assessment (VQA) without clean references or human opinion labels by proposing NovisVQ, a streaming, no-reference, opinion-unaware model. Trained on synthetically degraded DAVIS videos, NovisVQ uses a temporal, multi-scale ResNet encoder with LSTM modules and a lightweight MLP to predict per-frame FR metrics , , and directly from degraded video. Compared to an image-based baseline, NovisVQ leverages temporal context to generalize to unseen degradations and real-world motion blur, achieving strong correlations with ground-truth FR metrics on GOPRO data and surpassing BRISQUE for this objective alignment. This work demonstrates scalable, self-supervised VQA suitable for real-time video processing in vision systems, without requiring pristine references or human annotations. It highlights the critical role of temporal modeling in robust VQA and points to future work on downstream task integration and broader degradation modeling.

Abstract

Video quality assessment (VQA) is vital for computer vision tasks, but existing approaches face major limitations: full-reference (FR) metrics require clean reference videos, and most no-reference (NR) models depend on training on costly human opinion labels. Moreover, most opinion-unaware NR methods are image-based, ignoring temporal context critical for video object detection. In this work, we present a scalable, streaming-based VQA model that is both no-reference and opinion-unaware. Our model leverages synthetic degradations of the DAVIS dataset, training a temporal-aware convolutional architecture to predict FR metrics (LPIPS , PSNR, SSIM) directly from degraded video, without references at inference. We show that our streaming approach outperforms our own image-based baseline by generalizing across diverse degradations, underscoring the value of temporal modeling for scalable VQA in real-world vision systems. Additionally, we demonstrate that our model achieves higher correlation with full-reference metrics compared to BRISQUE, a widely-used opinion-aware image quality assessment baseline, validating the effectiveness of our temporal, opinion-unaware approach.

Paper Structure

This paper contains 15 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: NovisVQ architecture: multiscale ResNet encoder extracts features from degraded frames, LSTM captures temporal context, and MLP predicts LPIPS, PSNR, and SSIM scores.
  • Figure 2: Impact of Synthetic Augmentations on LPIPS, PSNR and SSIM
  • Figure 3: Validation loss curves for NovisIQ and NovisVQ across LPIPS, PSNR, SSIM, and total loss. The video-based model consistently converges, while the image-based model diverges, highlighting the necessity of streaming models for frame quality evaluation in video object detection systems.
  • Figure 4: Quality metric predictions on real-world motion blur. NovisVQ closely tracks ground truth across all metrics, while NovisIQ shows poor correlation, demonstrating temporal context enables better generalization from synthetic to real-world degradations.
  • Figure 5: Mean-centered correlation analysis comparing NovisVQ, NovisIQ, and BRISQUE predictions against ground truth full-reference metrics on real-world motion-blurred video. Each subplot shows mean-centered predictions for individual metrics (LPIPS, PSNR, SSIM) and a composite quality score (mean of all three metrics). NovisVQ (blue) demonstrates substantially stronger correlation with ground truth (black) compared to both NovisIQ (orange) and BRISQUE (green, shown on secondary axis). The composite metric plot provides an apples-to-apples comparison between our video-based approach and the image-based BRISQUE baseline, highlighting the value of temporal context for quality assessment. All metrics are smoothed with a 3-frame moving average and mean-centered for visualization. Correlation coefficients (r) are displayed in legends.