Table of Contents
Fetching ...

Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents

Davide Napolitano, Luca Cagliero, Fabrizio Battiloro

TL;DR

This work examines how Visual LLMs handle plausible unanswerable questions in multi-page Visually Rich Documents (VRDs). It introduces VRD-UQA, a modular evaluation framework that automatically corrupts questions along NLP entities, document elements, and layouts, verifies actual unanswerability with a VLLM-based judge, and benchmarks 12 models on two VRD datasets. Key findings show that pretraining strategy and architecture strongly influence unanswerability detection, while corruption type and document length significantly affect performance; OCR and explicit unanswerability prompts yield notable gains, especially at the page level. The framework provides an open-source path toward building more robust VRD-VQA systems and highlights directions for model customization and in-context learning to mitigate unanswerability detection limitations.

Abstract

The evolution of Visual Large Language Models (VLLMs) has revolutionized the automatic understanding of Visually Rich Documents (VRDs), which contain both textual and visual elements. Although VLLMs excel in Visual Question Answering (VQA) on multi-page VRDs, their ability to detect unanswerable questions is still an open research question. Our research delves into the robustness of the VLLMs to plausible yet unanswerable questions, i.e., questions that appear valid but cannot be answered due to subtle corruptions caused by swaps between related concepts or plausible question formulations. Corruptions are generated by replacing the original natural language entities with other ones of the same type, belonging to different document elements, and in different layout positions or pages of the related document. To this end, we present VRD-UQA (VISUALLY RICH DOCUMENT UNANSWERABLE QUESTION ANSWERING), a benchmark for evaluating VLLMs' resilience to plausible yet unanswerable questions across multiple dimensions. It automatically alters the questions of existing VQA datasets consisting of multi-page VRDs, verifies their unanswerability using a VLLM-as-a-judge approach, and then thoroughly evaluates VLLMs' performance. Experiments, run on 12 models, analyze: (1) The VLLMs' accuracy in detecting unanswerable questions at both page and document levels; (2) The effect of different types of corruption (NLP entity, document element, layout); (3) The effectiveness of different knowledge injection strategies based on in-context learning (OCR, multi-page selection, or the possibility of unanswerability). Our findings reveal VLLMs' limitations and demonstrate that VRD-UQA can serve as an evaluation framework for developing resilient document VQA systems.

Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents

TL;DR

This work examines how Visual LLMs handle plausible unanswerable questions in multi-page Visually Rich Documents (VRDs). It introduces VRD-UQA, a modular evaluation framework that automatically corrupts questions along NLP entities, document elements, and layouts, verifies actual unanswerability with a VLLM-based judge, and benchmarks 12 models on two VRD datasets. Key findings show that pretraining strategy and architecture strongly influence unanswerability detection, while corruption type and document length significantly affect performance; OCR and explicit unanswerability prompts yield notable gains, especially at the page level. The framework provides an open-source path toward building more robust VRD-VQA systems and highlights directions for model customization and in-context learning to mitigate unanswerability detection limitations.

Abstract

The evolution of Visual Large Language Models (VLLMs) has revolutionized the automatic understanding of Visually Rich Documents (VRDs), which contain both textual and visual elements. Although VLLMs excel in Visual Question Answering (VQA) on multi-page VRDs, their ability to detect unanswerable questions is still an open research question. Our research delves into the robustness of the VLLMs to plausible yet unanswerable questions, i.e., questions that appear valid but cannot be answered due to subtle corruptions caused by swaps between related concepts or plausible question formulations. Corruptions are generated by replacing the original natural language entities with other ones of the same type, belonging to different document elements, and in different layout positions or pages of the related document. To this end, we present VRD-UQA (VISUALLY RICH DOCUMENT UNANSWERABLE QUESTION ANSWERING), a benchmark for evaluating VLLMs' resilience to plausible yet unanswerable questions across multiple dimensions. It automatically alters the questions of existing VQA datasets consisting of multi-page VRDs, verifies their unanswerability using a VLLM-as-a-judge approach, and then thoroughly evaluates VLLMs' performance. Experiments, run on 12 models, analyze: (1) The VLLMs' accuracy in detecting unanswerable questions at both page and document levels; (2) The effect of different types of corruption (NLP entity, document element, layout); (3) The effectiveness of different knowledge injection strategies based on in-context learning (OCR, multi-page selection, or the possibility of unanswerability). Our findings reveal VLLMs' limitations and demonstrate that VRD-UQA can serve as an evaluation framework for developing resilient document VQA systems.

Paper Structure

This paper contains 33 sections, 5 figures, 15 tables.

Figures (5)

  • Figure 1: VRD-UQA generates unanswerable questions starting from an answerable question and the reference document. Example from DUDE LandeghemPTJBBC23).
  • Figure 2: The Visually Rich Document Unanswerable Question Answering framework.
  • Figure 3: Impact of corruption type on VQA performance across datasets and sizes of the models (representative subset).
  • Figure 4: Effect of augmented information and window size on $Acc_D$ performance. DUDE dataset.
  • Figure 5: Ablation study on in-context learning strategy and window size. MPDocVQA dataset (addressing RQ3)