Table of Contents
Fetching ...

InternVQA: Advancing Compressed Video Quality Assessment with Distilling Large Foundation Model

Fengbin Guan, Zihao Yu, Yiting Lu, Xin Li, Zhibo Chen

TL;DR

This work addresses efficient compression video quality assessment by leveraging the semantic, texture, and temporal features captured by a large video foundation model (InternVideo2). It presents a dual-loss distillation framework transfers representations from InternVideo2 to lightweight students, comparing homologous and heterogeneous backbones; the loss is defined as $\mathcal{L}_{\text{total}} = \mathcal{L}_2^{\text{teacher}} + \mathcal{L}_2^{\text{student}} + \mathcal{L}_{\text{Smooth} \mathcal{L}_1}$. Key contributions: (1) successful transfer of InternVideo2 representations to small ViT-based models, (2) homologous distillation often matching or exceeding the teacher on BVI-HD and Waterloo4K, (3) a systematic backbone comparison. Findings: the proposed distillation achieves high performance with substantially reduced resource needs, enabling practical, high-quality compression VQA.

Abstract

Video quality assessment tasks rely heavily on the rich features required for video understanding, such as semantic information, texture, and temporal motion. The existing video foundational model, InternVideo2, has demonstrated strong potential in video understanding tasks due to its large parameter size and large-scale multimodal data pertaining. Building on this, we explored the transferability of InternVideo2 to video quality assessment under compression scenarios. To design a lightweight model suitable for this task, we proposed a distillation method to equip the smaller model with rich compression quality priors. Additionally, we examined the performance of different backbones during the distillation process. The results showed that, compared to other methods, our lightweight model distilled from InternVideo2 achieved excellent performance in compression video quality assessment.

InternVQA: Advancing Compressed Video Quality Assessment with Distilling Large Foundation Model

TL;DR

This work addresses efficient compression video quality assessment by leveraging the semantic, texture, and temporal features captured by a large video foundation model (InternVideo2). It presents a dual-loss distillation framework transfers representations from InternVideo2 to lightweight students, comparing homologous and heterogeneous backbones; the loss is defined as . Key contributions: (1) successful transfer of InternVideo2 representations to small ViT-based models, (2) homologous distillation often matching or exceeding the teacher on BVI-HD and Waterloo4K, (3) a systematic backbone comparison. Findings: the proposed distillation achieves high performance with substantially reduced resource needs, enabling practical, high-quality compression VQA.

Abstract

Video quality assessment tasks rely heavily on the rich features required for video understanding, such as semantic information, texture, and temporal motion. The existing video foundational model, InternVideo2, has demonstrated strong potential in video understanding tasks due to its large parameter size and large-scale multimodal data pertaining. Building on this, we explored the transferability of InternVideo2 to video quality assessment under compression scenarios. To design a lightweight model suitable for this task, we proposed a distillation method to equip the smaller model with rich compression quality priors. Additionally, we examined the performance of different backbones during the distillation process. The results showed that, compared to other methods, our lightweight model distilled from InternVideo2 achieved excellent performance in compression video quality assessment.

Paper Structure

This paper contains 16 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: Illustration of the Model Structure and Distillation Process.