Table of Contents
Fetching ...

VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment

Ziheng Jia, Linhan Cao, Jinliang Han, Zicheng Zhang, Jiaying Qian, Jiarui Wang, Zijian Chen, Guangtao Zhai, Xiongkuo Min

TL;DR

This work addresses the limited transferability of existing VQualA LMMs by introducing VITAL, a vision-encoder-centered pre-training framework that builds a large machine-annotated VL dataset of $4.58M$ pairs across visual quality scoring and text generation. By freezing the LLM and focusing pre-training on the vision encoder, the approach achieves strong zero-shot performance and enables efficient model zoo expansion, including post-training warm-up with as few as $4000$ samples. The PMOD-based scoring and dynamic focal loss for generation, combined with multi-task objectives, yield robust quality interpretation and high-quality text generation without human labels. Experimental results show superior generalization on IQA/VQA tasks, strong transferability to diverse decoders, and competitive text-generation capabilities, establishing a foundation for deployment-friendly, foundation-level LMMs in VQualA.

Abstract

Developing a robust visual quality assessment (VQualA) large multi-modal model (LMM) requires achieving versatility, powerfulness, and transferability. However, existing VQualA LMMs typically focus on a single task and rely on full-parameter fine-tuning, which makes them prone to overfitting on specific modalities or task types, thereby limiting their generalization capacity and transferability. To address this, we propose a vision-encoder-centered generative pre-training pipeline and develop the VITAL-Series LMMs. (1) We adopt a machine-executed annotation-scrutiny paradigm, constructing over 4.5M vision-language (VL) pairs-the largest VQualA training dataset to date. (2) We employ a multi-task training workflow that simultaneously enhances the model's quantitative scoring precision and strengthens its capability for quality interpretation across both image and video modalities. (3) Building upon the vision encoder, we realize an efficient model zoo extension: the model zoo exhibits strong zero-shot performance, and each paired decoder requires only a swift warm-up using less than 1/1000 of the pre-training data to achieve performance comparable to the fully trained counterpart. Overall, our work lays a cornerstone for advancing toward the foundation LMM for VQualA.

VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment

TL;DR

This work addresses the limited transferability of existing VQualA LMMs by introducing VITAL, a vision-encoder-centered pre-training framework that builds a large machine-annotated VL dataset of pairs across visual quality scoring and text generation. By freezing the LLM and focusing pre-training on the vision encoder, the approach achieves strong zero-shot performance and enables efficient model zoo expansion, including post-training warm-up with as few as samples. The PMOD-based scoring and dynamic focal loss for generation, combined with multi-task objectives, yield robust quality interpretation and high-quality text generation without human labels. Experimental results show superior generalization on IQA/VQA tasks, strong transferability to diverse decoders, and competitive text-generation capabilities, establishing a foundation for deployment-friendly, foundation-level LMMs in VQualA.

Abstract

Developing a robust visual quality assessment (VQualA) large multi-modal model (LMM) requires achieving versatility, powerfulness, and transferability. However, existing VQualA LMMs typically focus on a single task and rely on full-parameter fine-tuning, which makes them prone to overfitting on specific modalities or task types, thereby limiting their generalization capacity and transferability. To address this, we propose a vision-encoder-centered generative pre-training pipeline and develop the VITAL-Series LMMs. (1) We adopt a machine-executed annotation-scrutiny paradigm, constructing over 4.5M vision-language (VL) pairs-the largest VQualA training dataset to date. (2) We employ a multi-task training workflow that simultaneously enhances the model's quantitative scoring precision and strengthens its capability for quality interpretation across both image and video modalities. (3) Building upon the vision encoder, we realize an efficient model zoo extension: the model zoo exhibits strong zero-shot performance, and each paired decoder requires only a swift warm-up using less than 1/1000 of the pre-training data to achieve performance comparable to the fully trained counterpart. Overall, our work lays a cornerstone for advancing toward the foundation LMM for VQualA.

Paper Structure

This paper contains 45 sections, 35 equations, 16 figures, 13 tables.

Figures (16)

  • Figure 1: Comprehensive dataset construction underpins the versatility and performance of VQualA LMMs, while an effective training strategy is also essential for improving transferability and scalability. Most existing works focus on a single visual modality or task and depend heavily on human annotation, limiting scalability and versatility. In addition, fine-tuning the LLM decoder often causes overfitting, reducing transferability. These limitations serve as our motivation.
  • Figure 2: Overall workflow of the dataset preparation and the pretraining process of our VITAL Series.
  • Figure 3: Machine-executed data annotation and quality scunity for the quality interpreting subtask.
  • Figure 4: Plots of data scaling effects ((a) and (b)) and the focal loss ablation study ((c) and (d)).
  • Figure 6: Wordclouds of the VL pairs in the text generation task.
  • ...and 11 more figures