Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models

Jiaxi Huang; Dongxu Wu; Hanwei Zhu; Lingyu Zhu; Jun Xing; Xu Wang; Baoliang Chen

Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models

Jiaxi Huang, Dongxu Wu, Hanwei Zhu, Lingyu Zhu, Jun Xing, Xu Wang, Baoliang Chen

TL;DR

Q-Doc introduces a three-tier benchmark to systematically probe document image quality assessment in multi-modal LLMs using the $SRCC$ and $PLCC$ correlation metrics on the SmartDoc-QA dataset. It defines coarse-level quality scoring, middle-level distortion type recognition, and fine-level distortion severity estimation, with chain-of-thought prompting shown to boost performance across all tiers. Evaluation across six representative MLLMs reveals nascent DIQA capabilities and consistent limitations in distortion interpretation, though CoT prompting yields meaningful gains. The benchmark provides a publicly available protocol to advance OCR-aware, quality-centric document understanding in multi-modal models.

Abstract

The rapid advancement of Multi-modal Large Language Models (MLLMs) has expanded their capabilities beyond high-level vision tasks. Nevertheless, their potential for Document Image Quality Assessment (DIQA) remains underexplored. To bridge this gap, we propose Q-Doc, a three-tiered evaluation framework for systematically probing DIQA capabilities of MLLMs at coarse, middle, and fine granularity levels. a) At the coarse level, we instruct MLLMs to assign quality scores to document images and analyze their correlation with Quality Annotations. b) At the middle level, we design distortion-type identification tasks, including single-choice and multi-choice tests for multi-distortion scenarios. c) At the fine level, we introduce distortion-severity assessment where MLLMs classify distortion intensity against human-annotated references. Our evaluation demonstrates that while MLLMs possess nascent DIQA abilities, they exhibit critical limitations: inconsistent scoring, distortion misidentification, and severity misjudgment. Significantly, we show that Chain-of-Thought (CoT) prompting substantially enhances performance across all levels. Our work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement. The benchmark and code are publicly available at: https://github.com/cydxf/Q-Doc.

Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models

TL;DR

Abstract

Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)