Table of Contents
Fetching ...

Grounding-IQA: Grounding Multimodal Language Model for Image Quality Assessment

Zheng Chen, Xun Zhang, Wenbo Li, Renjing Pei, Fenglong Song, Xiongkuo Min, Xiaohong Liu, Xin Yuan, Yong Guo, Yulun Zhang

TL;DR

This work introduces Grounding-IQA, a paradigm that combines multimodal referring and grounding with image quality assessment to achieve fine-grained, region-aware IQA. It defines two tasks, GIQA-DES for descriptive quality with precise locations and GIQA-VQA for location-aware QA, and builds GIQA-160K via an automated annotation pipeline, plus GIQA-Bench for comprehensive evaluation along description quality, VQA accuracy, and grounding precision. Fine-tuning multiple MLLMs on GIQA-160K demonstrates improved grounding and description capabilities, enabling more accurate region-specific quality judgments. The GIQA framework and benchmark advance practical, fine-grained IQA suitable for downstream editing and quality control in multimodal systems, with code provided for reproducibility.

Abstract

The development of multimodal large language models (MLLMs) enables the evaluation of image quality through natural language descriptions. This advancement allows for more detailed assessments. However, these MLLM-based IQA methods primarily rely on general contextual descriptions, sometimes limiting fine-grained quality assessment. To address this limitation, we introduce a new image quality assessment (IQA) task paradigm, **grounding-IQA**. This paradigm integrates multimodal referring and grounding with IQA to realize more fine-grained quality perception, thereby extending existing IQA. Specifically, grounding-IQA comprises two subtasks: grounding-IQA-description (GIQA-DES) and visual question answering (GIQA-VQA). GIQA-DES involves detailed descriptions with precise locations (e.g., bounding boxes), while GIQA-VQA focuses on quality QA for local regions. To realize grounding-IQA, we construct a corresponding dataset, GIQA-160K, through our proposed automated annotation pipeline. Furthermore, we develop a well-designed benchmark, GIQA-Bench. The benchmark evaluates the grounding-IQA performance from three perspectives: description quality, VQA accuracy, and grounding precision. Experiments demonstrate that our proposed method facilitates the more fine-grained IQA application. Code: https://github.com/zhengchen1999/Grounding-IQA.

Grounding-IQA: Grounding Multimodal Language Model for Image Quality Assessment

TL;DR

This work introduces Grounding-IQA, a paradigm that combines multimodal referring and grounding with image quality assessment to achieve fine-grained, region-aware IQA. It defines two tasks, GIQA-DES for descriptive quality with precise locations and GIQA-VQA for location-aware QA, and builds GIQA-160K via an automated annotation pipeline, plus GIQA-Bench for comprehensive evaluation along description quality, VQA accuracy, and grounding precision. Fine-tuning multiple MLLMs on GIQA-160K demonstrates improved grounding and description capabilities, enabling more accurate region-specific quality judgments. The GIQA framework and benchmark advance practical, fine-grained IQA suitable for downstream editing and quality control in multimodal systems, with code provided for reproducibility.

Abstract

The development of multimodal large language models (MLLMs) enables the evaluation of image quality through natural language descriptions. This advancement allows for more detailed assessments. However, these MLLM-based IQA methods primarily rely on general contextual descriptions, sometimes limiting fine-grained quality assessment. To address this limitation, we introduce a new image quality assessment (IQA) task paradigm, **grounding-IQA**. This paradigm integrates multimodal referring and grounding with IQA to realize more fine-grained quality perception, thereby extending existing IQA. Specifically, grounding-IQA comprises two subtasks: grounding-IQA-description (GIQA-DES) and visual question answering (GIQA-VQA). GIQA-DES involves detailed descriptions with precise locations (e.g., bounding boxes), while GIQA-VQA focuses on quality QA for local regions. To realize grounding-IQA, we construct a corresponding dataset, GIQA-160K, through our proposed automated annotation pipeline. Furthermore, we develop a well-designed benchmark, GIQA-Bench. The benchmark evaluates the grounding-IQA performance from three perspectives: description quality, VQA accuracy, and grounding precision. Experiments demonstrate that our proposed method facilitates the more fine-grained IQA application. Code: https://github.com/zhengchen1999/Grounding-IQA.

Paper Structure

This paper contains 14 sections, 2 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Performance comparisons on GIQA-Bench. Our proposed grounding-GPT effectively combines grounding and IQA.
  • Figure 2: Grounding-IQA combines referring and grounding with IQA. (a) GIQA-DES: Quality description include precise locations (i.e., bounding boxes). (b) GIQA-VQA: The question (referring, bottom instance) or answer (grounding, top instance) contains locations.
  • Figure 3: The illustration of the automated annotation pipeline. (a) GIQA-DES Pipeline: Constructs the $\bm{answer}$ from the given image and description via a four-stage process, while the $\bm{question}$ comes from a predefined question pool. (b) GIQA-VQA Pipeline: Generates the corresponding QA data utilizing descriptions from GIQA-DES and the LLM (Llama3 dubey2024llama).
  • Figure 3: Ablation study on multi-task training. The baseline is the pre-trained model, mPLUG-Owl2-7B, without fine-tuning.
  • Figure 4: Utilizing the description phrase $\mathcal{T}_{r}$ ("the man wearing a white t-shirt") yields more accurate detection than applying object name ("man").
  • ...and 4 more figures