InternVQA: Advancing Compressed Video Quality Assessment with Distilling Large Foundation Model
Fengbin Guan, Zihao Yu, Yiting Lu, Xin Li, Zhibo Chen
TL;DR
This work addresses efficient compression video quality assessment by leveraging the semantic, texture, and temporal features captured by a large video foundation model (InternVideo2). It presents a dual-loss distillation framework transfers representations from InternVideo2 to lightweight students, comparing homologous and heterogeneous backbones; the loss is defined as $\mathcal{L}_{\text{total}} = \mathcal{L}_2^{\text{teacher}} + \mathcal{L}_2^{\text{student}} + \mathcal{L}_{\text{Smooth} \mathcal{L}_1}$. Key contributions: (1) successful transfer of InternVideo2 representations to small ViT-based models, (2) homologous distillation often matching or exceeding the teacher on BVI-HD and Waterloo4K, (3) a systematic backbone comparison. Findings: the proposed distillation achieves high performance with substantially reduced resource needs, enabling practical, high-quality compression VQA.
Abstract
Video quality assessment tasks rely heavily on the rich features required for video understanding, such as semantic information, texture, and temporal motion. The existing video foundational model, InternVideo2, has demonstrated strong potential in video understanding tasks due to its large parameter size and large-scale multimodal data pertaining. Building on this, we explored the transferability of InternVideo2 to video quality assessment under compression scenarios. To design a lightweight model suitable for this task, we proposed a distillation method to equip the smaller model with rich compression quality priors. Additionally, we examined the performance of different backbones during the distillation process. The results showed that, compared to other methods, our lightweight model distilled from InternVideo2 achieved excellent performance in compression video quality assessment.
