Table of Contents
Fetching ...

R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment

Zhuangzi Li, Jian Jin, Shilv Cai, Weisi Lin

TL;DR

It is found that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM's understanding of a given CG image.

Abstract

Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: First, existing CG datasets lack systematic descriptions of rendering quality; and second existing CG quality assessment methods cannot provide reasonable text-based explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of 3500 CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question-answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM's understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. Experiments on several representative VLMs demonstrate that our method substantially improves their performance on CG quality assessment.

R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment

TL;DR

It is found that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM's understanding of a given CG image.

Abstract

Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: First, existing CG datasets lack systematic descriptions of rendering quality; and second existing CG quality assessment methods cannot provide reasonable text-based explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of 3500 CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question-answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM's understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. Experiments on several representative VLMs demonstrate that our method substantially improves their performance on CG quality assessment.
Paper Structure (20 sections, 13 equations, 8 figures, 4 tables)

This paper contains 20 sections, 13 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: An example illustrating the capability of the proposed R4-CGQA. R4-CGQA not only makes the responses of large models more concise and accurate, but also effectively unleashes their potential on the CGQA task.
  • Figure 2: We prepare 10 CG images with associated questions and answers (Q1: multiple-choice questions and Q2: yes-or-no questions) and evaluate answer accuracy on LLaVA-13B DBLP:conf/nips/LiuLWL23a. Dark bars indicate accuracy when the evaluation is directly provided by the VLM; light bars indicate accuracy when reference descriptions from visually similar CG images are provided to the VLM.
  • Figure 3: CG Image Quality Perception Factors.
  • Figure 4: Overview of dataset characteristics. (a) Example image–description pairs illustrating diverse visual content. (b) Distribution of description lengths across the dataset. (c) Word cloud visualization of the most frequent descriptive terms, highlighting key perceptual attributes such as texture, light, and realism. (d) Top 30 most frequent words and their corresponding frequencies.
  • Figure 5: GPT-4o is used to generate three types of questions for the validation and testing sets.
  • ...and 3 more figures