Table of Contents
Fetching ...

HighlightBench: Benchmarking Markup-Driven Table Reasoning in Scientific Documents

Lexin Wang, Shenghua Liu, Yiwei Wang, Yujun Cai, Yuyao Ge, Jiayu Yao, Jiafeng Guo, Xueqi Cheng

Abstract

Visual markups such as highlights, underlines, and bold text are common in table-centric documents. Although multimodal large language models (MLLMs) have made substantial progress in document understanding, their ability to treat such cues as explicit logical directives remains under-explored. More importantly, existing evaluations cannot distinguish whether a model fails to see the markup or fails to reason with it. This creates a key blind spot in assessing markup-conditioned behavior over tables. To address this gap, we introduce HighlightBench, a diagnostic benchmark for markup-driven table understanding that decomposes evaluation into five task families: Markup Grounding, Constrained Retrieval, Local Relations, Aggregation \& Comparison, and Consistency \& Missingness. We further provide a reference pipeline that makes intermediate decisions explicit, enabling reproducible baselines and finer-grained attribution of errors along the perception-to-execution chain. Experiments show that even strong models remain unstable when visual cues must be consistently aligned with symbolic reasoning under structured output constraints.

HighlightBench: Benchmarking Markup-Driven Table Reasoning in Scientific Documents

Abstract

Visual markups such as highlights, underlines, and bold text are common in table-centric documents. Although multimodal large language models (MLLMs) have made substantial progress in document understanding, their ability to treat such cues as explicit logical directives remains under-explored. More importantly, existing evaluations cannot distinguish whether a model fails to see the markup or fails to reason with it. This creates a key blind spot in assessing markup-conditioned behavior over tables. To address this gap, we introduce HighlightBench, a diagnostic benchmark for markup-driven table understanding that decomposes evaluation into five task families: Markup Grounding, Constrained Retrieval, Local Relations, Aggregation \& Comparison, and Consistency \& Missingness. We further provide a reference pipeline that makes intermediate decisions explicit, enabling reproducible baselines and finer-grained attribution of errors along the perception-to-execution chain. Experiments show that even strong models remain unstable when visual cues must be consistently aligned with symbolic reasoning under structured output constraints.

Paper Structure

This paper contains 20 sections, 2 equations, 8 figures, 2 tables, 2 algorithms.

Figures (8)

  • Figure 1: Overview of HighlightBench. The benchmark contains five complementary task families with question counts. This design gives transparent coverage of markup-conditioned reasoning.
  • Figure 2: Pipeline overview. The reference pipeline converts the input image into a unified docgraph, uses two-stage routing to determine the solving path, and executes the routed result as a DSL plan on the docgraph to produce the final structured output and its intermediate trace.
  • Figure 3: Representative examples of the five task families in HighlightBench. Each card shows the core capability tested by a task family, together with a typical question format and its expected structured output.
  • Figure 4: Extremum Reasoning accuracy under the counterfactual variants.
  • Figure 5: Perception-task accuracy under counterfactual variants. Rows correspond to Bold Presence, Underline Presence, and Highlight Presence.
  • ...and 3 more figures