Table of Contents
Fetching ...

Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition

Jian Sun, Mohammad H. Mahoor

TL;DR

Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) is proposed to fulfill the goal of addressing the label shortage of VQA and address the label shortage of VQA in video datasets.

Abstract

Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3's effectiveness.

Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition

TL;DR

Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) is proposed to fulfill the goal of addressing the label shortage of VQA and address the label shortage of VQA in video datasets.

Abstract

Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3's effectiveness.
Paper Structure (32 sections, 6 equations, 9 figures, 11 tables)

This paper contains 32 sections, 6 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: The structure of the new proposed SSL-V3. In the upper branch, ViViT(FE) loads $\textbf{X}_{1}$ and extracts Spatio-Temporal features $f_{\text{S}_{1}}$. The VQA head and classification (CLS) head then process $f_{\text{S}_{1}}$ to compute $\text{VQS}_{1}$ and $\text{CLS}_{1}$, respectively - output features from each head. VQA head contains Sequence Score Regressor and Video Score Regressor, while CLS head is Multi-branch Classifier. The Tune-CLS module updates $\text{CLS}_{1}$ using $\text{VQS}_{1}$ to produce the prediction score $\hat{\text{Y}}_{1}$. Simultaneously, the lower branch follows the same procedure to generate $f_{\text{S}_{2}}$, $\text{VQS}_{2}$, $\text{CLS}_{2}$, and $\hat{\text{Y}}_{2}$.
  • Figure 2: The structure of Sequence Score Regressor. It has two channels. The first one, Sequence Feature Processor, is to drop the $z_{cls1}$ and reshape $f_{S_{1}}$ from [bs, $n_{t}$, d] to [bs, $n_{t}$, d, 1]. The other one, Sequence Weight Estimate Network, is to generate weights, $\tilde{w}_{f}$ for corresponding $f_{S_{1}}$. Finally, it conducts weighted sum operation to get $SQS_{1}$. $C_{i,j}$ is the $j^{th}$ cube from the $i^{th}$ sequence in a certain clip.
  • Figure 3: The structure of VSR. It has three channels. The central one is the $SQS_{1}$, denoted by $s_{2}$. The upper one does 1-D convolutional operation on $SQS_{1}$ and captures the temporal motion effect, $m_{j}$, of the $SQS_{1,j}$. Then, we deploy the Softmax operation to normalize the weight vector $m$ and output $\hat{m}$. Multiplying $SQS_{1}$ and $\hat{m}$ returns the temporal motion effect score $s_{1}$. Following the same pattern, we get the normalized temporal hysteresis effect $\hat{h}$ and the temporal hysteresis effect score $s_{3}$. Finally, the Temporal Memory-Based Fusion Module generates $VQS_{1}$ by concatenating $s_{1}$, $s_{2}$, and $s_{3}$ and loading them into a two-layers FC module.
  • Figure 4: Two sample frames from the I-CONECT dataset. In (a), the window of the interviewee is bigger than that of the interviewer because the interviewer was speaking. Conversely, in (b), the interviewer was talking so that her window was bigger than the interviewee's.
  • Figure 5: More subjects from Hockey Fight Detection dataset. The upper line represents high-quality frames. The bottom line stands for poor-quality frames.
  • ...and 4 more figures