Table of Contents
Fetching ...

Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models

Jiaxi Huang, Dongxu Wu, Hanwei Zhu, Lingyu Zhu, Jun Xing, Xu Wang, Baoliang Chen

TL;DR

Q-Doc introduces a three-tier benchmark to systematically probe document image quality assessment in multi-modal LLMs using the $SRCC$ and $PLCC$ correlation metrics on the SmartDoc-QA dataset. It defines coarse-level quality scoring, middle-level distortion type recognition, and fine-level distortion severity estimation, with chain-of-thought prompting shown to boost performance across all tiers. Evaluation across six representative MLLMs reveals nascent DIQA capabilities and consistent limitations in distortion interpretation, though CoT prompting yields meaningful gains. The benchmark provides a publicly available protocol to advance OCR-aware, quality-centric document understanding in multi-modal models.

Abstract

The rapid advancement of Multi-modal Large Language Models (MLLMs) has expanded their capabilities beyond high-level vision tasks. Nevertheless, their potential for Document Image Quality Assessment (DIQA) remains underexplored. To bridge this gap, we propose Q-Doc, a three-tiered evaluation framework for systematically probing DIQA capabilities of MLLMs at coarse, middle, and fine granularity levels. a) At the coarse level, we instruct MLLMs to assign quality scores to document images and analyze their correlation with Quality Annotations. b) At the middle level, we design distortion-type identification tasks, including single-choice and multi-choice tests for multi-distortion scenarios. c) At the fine level, we introduce distortion-severity assessment where MLLMs classify distortion intensity against human-annotated references. Our evaluation demonstrates that while MLLMs possess nascent DIQA abilities, they exhibit critical limitations: inconsistent scoring, distortion misidentification, and severity misjudgment. Significantly, we show that Chain-of-Thought (CoT) prompting substantially enhances performance across all levels. Our work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement. The benchmark and code are publicly available at: https://github.com/cydxf/Q-Doc.

Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models

TL;DR

Q-Doc introduces a three-tier benchmark to systematically probe document image quality assessment in multi-modal LLMs using the and correlation metrics on the SmartDoc-QA dataset. It defines coarse-level quality scoring, middle-level distortion type recognition, and fine-level distortion severity estimation, with chain-of-thought prompting shown to boost performance across all tiers. Evaluation across six representative MLLMs reveals nascent DIQA capabilities and consistent limitations in distortion interpretation, though CoT prompting yields meaningful gains. The benchmark provides a publicly available protocol to advance OCR-aware, quality-centric document understanding in multi-modal models.

Abstract

The rapid advancement of Multi-modal Large Language Models (MLLMs) has expanded their capabilities beyond high-level vision tasks. Nevertheless, their potential for Document Image Quality Assessment (DIQA) remains underexplored. To bridge this gap, we propose Q-Doc, a three-tiered evaluation framework for systematically probing DIQA capabilities of MLLMs at coarse, middle, and fine granularity levels. a) At the coarse level, we instruct MLLMs to assign quality scores to document images and analyze their correlation with Quality Annotations. b) At the middle level, we design distortion-type identification tasks, including single-choice and multi-choice tests for multi-distortion scenarios. c) At the fine level, we introduce distortion-severity assessment where MLLMs classify distortion intensity against human-annotated references. Our evaluation demonstrates that while MLLMs possess nascent DIQA abilities, they exhibit critical limitations: inconsistent scoring, distortion misidentification, and severity misjudgment. Significantly, we show that Chain-of-Thought (CoT) prompting substantially enhances performance across all levels. Our work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement. The benchmark and code are publicly available at: https://github.com/cydxf/Q-Doc.

Paper Structure

This paper contains 27 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of the proposed Q-Doc benchmark. Q-Doc evaluates MLLMs across three levels: coarse-level quality scoring, middle-level distortion classification, and fine-level severity estimation, using real-world distorted document images.
  • Figure 2: Illustration of Chain-of-Thought (CoT) prompting. The model is encouraged to reason about textual readability and document clarity before producing a quality judgment.
  • Figure 3: Radar chart of MLLM performance across six quality evaluation metrics: Coarse-level correlation (SRCC, PLCC), Middle-level balanced accuracy (Single/Multiple), and Fine-level balanced accuracy (Single/Multiple).