Table of Contents
Fetching ...

Q-Ground: Image Quality Grounding with Large Multi-modality Models

Chaofeng Chen, Sensen Yang, Haoning Wu, Liang Liao, Zicheng Zhang, Annan Wang, Wenxiu Sun, Qiong Yan, Weisi Lin

TL;DR

This work introduces a novel visual quality grounding paradigm that moves beyond global IQA scores to finer, region-level distortion analysis guided by text prompts. It presents QGround-100K, a 100K-sample dataset with human and GPT4V annotations, built upon Q-Instruct to link image regions with textual quality descriptions and distortion masks. The authors propose a multi-scale feature abstractor (MSFA) within a large multimodality framework to jointly generate quality-centric text and pixel-level segmentation, trained via a multi-task objective that blends VQA, semantic segmentation, and visual quality reasoning. Empirical results on the new QGround benchmark show the approach can outperform traditional segmentation baselines in local distortion understanding while retaining interactive reasoning capabilities, signaling a practical step toward fine-grained, explainable IQA and interactive image editing workflows.

Abstract

Recent advances of large multi-modality models (LMM) have greatly improved the ability of image quality assessment (IQA) method to evaluate and explain the quality of visual content. However, these advancements are mostly focused on overall quality assessment, and the detailed examination of local quality, which is crucial for comprehensive visual understanding, is still largely unexplored. In this work, we introduce Q-Ground, the first framework aimed at tackling fine-scale visual quality grounding by combining large multi-modality models with detailed visual quality analysis. Central to our contribution is the introduction of the QGround-100K dataset, a novel resource containing 100k triplets of (image, quality text, distortion segmentation) to facilitate deep investigations into visual quality. The dataset comprises two parts: one with human-labeled annotations for accurate quality assessment, and another labeled automatically by LMMs such as GPT4V, which helps improve the robustness of model training while also reducing the costs of data collection. With the QGround-100K dataset, we propose a LMM-based method equipped with multi-scale feature learning to learn models capable of performing both image quality answering and distortion segmentation based on text prompts. This dual-capability approach not only refines the model's understanding of region-aware image quality but also enables it to interactively respond to complex, text-based queries about image quality and specific distortions. Q-Ground takes a step towards sophisticated visual quality analysis in a finer scale, establishing a new benchmark for future research in the area. Codes and dataset are available at https://github.com/Q-Future/Q-Ground.

Q-Ground: Image Quality Grounding with Large Multi-modality Models

TL;DR

This work introduces a novel visual quality grounding paradigm that moves beyond global IQA scores to finer, region-level distortion analysis guided by text prompts. It presents QGround-100K, a 100K-sample dataset with human and GPT4V annotations, built upon Q-Instruct to link image regions with textual quality descriptions and distortion masks. The authors propose a multi-scale feature abstractor (MSFA) within a large multimodality framework to jointly generate quality-centric text and pixel-level segmentation, trained via a multi-task objective that blends VQA, semantic segmentation, and visual quality reasoning. Empirical results on the new QGround benchmark show the approach can outperform traditional segmentation baselines in local distortion understanding while retaining interactive reasoning capabilities, signaling a practical step toward fine-grained, explainable IQA and interactive image editing workflows.

Abstract

Recent advances of large multi-modality models (LMM) have greatly improved the ability of image quality assessment (IQA) method to evaluate and explain the quality of visual content. However, these advancements are mostly focused on overall quality assessment, and the detailed examination of local quality, which is crucial for comprehensive visual understanding, is still largely unexplored. In this work, we introduce Q-Ground, the first framework aimed at tackling fine-scale visual quality grounding by combining large multi-modality models with detailed visual quality analysis. Central to our contribution is the introduction of the QGround-100K dataset, a novel resource containing 100k triplets of (image, quality text, distortion segmentation) to facilitate deep investigations into visual quality. The dataset comprises two parts: one with human-labeled annotations for accurate quality assessment, and another labeled automatically by LMMs such as GPT4V, which helps improve the robustness of model training while also reducing the costs of data collection. With the QGround-100K dataset, we propose a LMM-based method equipped with multi-scale feature learning to learn models capable of performing both image quality answering and distortion segmentation based on text prompts. This dual-capability approach not only refines the model's understanding of region-aware image quality but also enables it to interactively respond to complex, text-based queries about image quality and specific distortions. Q-Ground takes a step towards sophisticated visual quality analysis in a finer scale, establishing a new benchmark for future research in the area. Codes and dataset are available at https://github.com/Q-Future/Q-Ground.
Paper Structure (39 sections, 6 equations, 12 figures, 7 tables)

This paper contains 39 sections, 6 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: An example comparison between different tasks illustrates: (a) Visual quality scoring only provides a numerical score without an underlying rationale; (b) LMM-based reasoning offers clear explanations but lacks pixel-level comprehension; (c) the suggested approach to visual quality understanding not only facilitates quality reasoning but also delivers corresponding pixel-level distortion segmentation masks.
  • Figure 2: The data annotation pipeline incorporates both human expertise and GPT-4V capabilities. Firstly, the input image undergoes pre-segmentation using SAM li2023semanticsam. In the human annotation phase, subjectives need to identify and categorize types of distortions, with quality description texts from humans as reference. The subjective is free to adjust borders generated by SAM. In the GPT4V annotation phase, the reference for quality is generated by the Q-Instruct model. Then, each region is marked with a number, which is then coupled with the quality text and forwarded to the GPT4V model. Finally, the model outputs the types of distortions present in each specified region.
  • Figure 3: Analysis of annotation agreement between different human subjectives.
  • Figure 4: Statistics of human and GPT4V parts separately.
  • Figure 5: The pipeline of our method. (a) The framework follows previous methods lai2023lisaren2023pixellm and is designed to accept inputs of images and texts, subsequently producing textual outputs and segmentation results. (b)(c): comparison of multi-modal projection block between previous works and our proposed multi-scale feature abstractor.
  • ...and 7 more figures