Table of Contents
Fetching ...

Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

Wen Wen, Tianwu Zhi, Kanglong Fan, Yang Li, Xinge Peng, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang

TL;DR

EvoQuality presents a fully self-supervised framework that enables a vision-language model to evolve its image quality assessment (IQA) capabilities without ground-truth labels. It combines offline pairwise majority voting to derive pseudo-labels with online fidelity-reward updates via group relative policy optimization (GRPO), iterating to refine perceptual judgments. Across seven IQA benchmarks with diverse distortions, EvoQuality achieves up to 31.8% improvement in PLCC over a baseline and can outperform several supervised IQA methods in zero-shot settings. The work demonstrates the viability of self-consistency and ranking-based supervision for perceptual tasks where labeled data are scarce or unavailable, with potential implications for scalable quality assessment in real-world applications.

Abstract

Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques such as self-consistency have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8\% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks.

Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

TL;DR

EvoQuality presents a fully self-supervised framework that enables a vision-language model to evolve its image quality assessment (IQA) capabilities without ground-truth labels. It combines offline pairwise majority voting to derive pseudo-labels with online fidelity-reward updates via group relative policy optimization (GRPO), iterating to refine perceptual judgments. Across seven IQA benchmarks with diverse distortions, EvoQuality achieves up to 31.8% improvement in PLCC over a baseline and can outperform several supervised IQA methods in zero-shot settings. The work demonstrates the viability of self-consistency and ranking-based supervision for perceptual tasks where labeled data are scarce or unavailable, with potential implications for scalable quality assessment in real-world applications.

Abstract

Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques such as self-consistency have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8\% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks.

Paper Structure

This paper contains 14 sections, 6 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Without any ground truths, EvoQuality enables Qwen2.5-VL-7B to self-evolve its IQA capabilities, achieving (a) substantial performance improvements over the baseline and (b) superior or competitive results compared to supervised VLM-based models across multiple IQA benchmarks.
  • Figure 2: System diagram of the proposed self-evolving IQA framework EvoQuality. Each iteration operates in two stages. In the offline stage, the VLM generates pseudo-ranking labels for unlabeled image pairs via pairwise majority voting. In the subsequent online stage, the VLM's policy is updated via GRPO shao2024deepseekmath using a fidelity reward derived from these pseudo-labels. This iterative loop enables the VLM to self-evolve its understanding of image quality.
  • Figure 3: PLCC results of EvoQuality with varying $K$ candidate responses.