Table of Contents
Fetching ...

Unchecked and Overlooked: Addressing the Checkbox Blind Spot in Large Language Models with CheckboxQA

Michał Turski, Mateusz Chiliński, Łukasz Borchmann

TL;DR

CheckboxQA targets the underexplored problem of interpreting checkboxes in visually rich documents, where a small visual cue can drive critical decisions in legal and financial workflows. The authors curate a Document VQA benchmark with ~600 QA pairs across English documents, emphasizing varied layouts and types of questions (Yes/No and lists) to evaluate how well current systems ground checkbox states in surrounding text. Evaluation across commercial LVLMs and open-source baselines, using the Average Normalized Levenshtein Similarity ($ANLS$) metric (including the $ANLS^*$ variant with a 0.5 threshold), reveals that the top model reaches 83.2% while human performance stands at 97.5%, highlighting a persistent gap. The work shows that current document understanding models struggle with micro-level visual cues and layout cues, underscoring the need for layout-aware, form-focused approaches and providing a publicly available dataset to catalyze progress in real-world document processing for sectors like legal tech and finance, with potential to improve regulatory and contractual compliance.

Abstract

Checkboxes are critical in real-world document processing where the presence or absence of ticks directly informs data extraction and decision-making processes. Yet, despite the strong performance of Large Vision and Language Models across a wide range of tasks, they struggle with interpreting checkable content. This challenge becomes particularly pressing in industries where a single overlooked checkbox may lead to costly regulatory or contractual oversights. To address this gap, we introduce the CheckboxQA dataset, a targeted resource designed to evaluate and improve model performance on checkbox-related tasks. It reveals the limitations of current models and serves as a valuable tool for advancing document comprehension systems, with significant implications for applications in sectors such as legal tech and finance. The dataset is publicly available at: https://github.com/Snowflake-Labs/CheckboxQA

Unchecked and Overlooked: Addressing the Checkbox Blind Spot in Large Language Models with CheckboxQA

TL;DR

CheckboxQA targets the underexplored problem of interpreting checkboxes in visually rich documents, where a small visual cue can drive critical decisions in legal and financial workflows. The authors curate a Document VQA benchmark with ~600 QA pairs across English documents, emphasizing varied layouts and types of questions (Yes/No and lists) to evaluate how well current systems ground checkbox states in surrounding text. Evaluation across commercial LVLMs and open-source baselines, using the Average Normalized Levenshtein Similarity () metric (including the variant with a 0.5 threshold), reveals that the top model reaches 83.2% while human performance stands at 97.5%, highlighting a persistent gap. The work shows that current document understanding models struggle with micro-level visual cues and layout cues, underscoring the need for layout-aware, form-focused approaches and providing a publicly available dataset to catalyze progress in real-world document processing for sectors like legal tech and finance, with potential to improve regulatory and contractual compliance.

Abstract

Checkboxes are critical in real-world document processing where the presence or absence of ticks directly informs data extraction and decision-making processes. Yet, despite the strong performance of Large Vision and Language Models across a wide range of tasks, they struggle with interpreting checkable content. This challenge becomes particularly pressing in industries where a single overlooked checkbox may lead to costly regulatory or contractual oversights. To address this gap, we introduce the CheckboxQA dataset, a targeted resource designed to evaluate and improve model performance on checkbox-related tasks. It reveals the limitations of current models and serves as a valuable tool for advancing document comprehension systems, with significant implications for applications in sectors such as legal tech and finance. The dataset is publicly available at: https://github.com/Snowflake-Labs/CheckboxQA

Paper Structure

This paper contains 32 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: CheckboxQA consists of varied questions requiring interpretation of checkable content in the context of visually rich documents. Required answers range from simple yes/no to lists of values.
  • Figure 2: Excerpts from CheckboxQA documents (not an exhaustive list).
  • Figure 3: Histogram of collected documents lengths in terms of PDF pages and words. The plot on the right indicates a long tail of lengthy documents.
  • Figure 4: Most popular question prefixes.
  • Figure 5: Histogram of annotated questions and answers lengths.
  • ...and 3 more figures