VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment

Ziheng Jia; Linhan Cao; Jinliang Han; Zicheng Zhang; Jiaying Qian; Jiarui Wang; Zijian Chen; Guangtao Zhai; Xiongkuo Min

VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment

Ziheng Jia, Linhan Cao, Jinliang Han, Zicheng Zhang, Jiaying Qian, Jiarui Wang, Zijian Chen, Guangtao Zhai, Xiongkuo Min

TL;DR

This work addresses the limited transferability of existing VQualA LMMs by introducing VITAL, a vision-encoder-centered pre-training framework that builds a large machine-annotated VL dataset of $4.58M$ pairs across visual quality scoring and text generation. By freezing the LLM and focusing pre-training on the vision encoder, the approach achieves strong zero-shot performance and enables efficient model zoo expansion, including post-training warm-up with as few as $4000$ samples. The PMOD-based scoring and dynamic focal loss for generation, combined with multi-task objectives, yield robust quality interpretation and high-quality text generation without human labels. Experimental results show superior generalization on IQA/VQA tasks, strong transferability to diverse decoders, and competitive text-generation capabilities, establishing a foundation for deployment-friendly, foundation-level LMMs in VQualA.

Abstract

Developing a robust visual quality assessment (VQualA) large multi-modal model (LMM) requires achieving versatility, powerfulness, and transferability. However, existing VQualA LMMs typically focus on a single task and rely on full-parameter fine-tuning, which makes them prone to overfitting on specific modalities or task types, thereby limiting their generalization capacity and transferability. To address this, we propose a vision-encoder-centered generative pre-training pipeline and develop the VITAL-Series LMMs. (1) We adopt a machine-executed annotation-scrutiny paradigm, constructing over 4.5M vision-language (VL) pairs-the largest VQualA training dataset to date. (2) We employ a multi-task training workflow that simultaneously enhances the model's quantitative scoring precision and strengthens its capability for quality interpretation across both image and video modalities. (3) Building upon the vision encoder, we realize an efficient model zoo extension: the model zoo exhibits strong zero-shot performance, and each paired decoder requires only a swift warm-up using less than 1/1000 of the pre-training data to achieve performance comparable to the fully trained counterpart. Overall, our work lays a cornerstone for advancing toward the foundation LMM for VQualA.

VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment

TL;DR

Abstract

VITAL: Vision-Encoder-centered Pre-training for LMMs in Visual Quality Assessment

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)