Table of Contents
Fetching ...

TABLET: A Large-Scale Dataset for Robust Visual Table Understanding

Iñigo Alonso, Imanol Miranda, Eneko Agirre, Mirella Lapata

TL;DR

TABLET addresses a key gap in Visual Table Understanding by providing a large-scale, lossless dataset that preserves original table visualizations while offering HTML representations and provenance across 20 tasks and 14 seed datasets. By training vision-language models on this diverse, traceable resource, TABLET improves robustness to real-world table visuals and demonstrates transfer to unseen VTU benchmarks. The dataset emphasizes extensibility and evaluation alignment with downstream VTU use cases, enabling more realistic, multilingual, and multimodal table reasoning. Overall, TABLET significantly advances VTU by enabling scalable, grounded training and more faithful evaluation of table understanding in pixel-based systems.

Abstract

While table understanding increasingly relies on pixel-only settings where tables are processed as visual representations, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 20 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. Each example includes paired image-HTML representations, comprehensive metadata, and provenance information linking back to the source datasets. Fine-tuning vision-language models like Qwen2.5-VL-7B on TABLET improves performance on seen and unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.

TABLET: A Large-Scale Dataset for Robust Visual Table Understanding

TL;DR

TABLET addresses a key gap in Visual Table Understanding by providing a large-scale, lossless dataset that preserves original table visualizations while offering HTML representations and provenance across 20 tasks and 14 seed datasets. By training vision-language models on this diverse, traceable resource, TABLET improves robustness to real-world table visuals and demonstrates transfer to unseen VTU benchmarks. The dataset emphasizes extensibility and evaluation alignment with downstream VTU use cases, enabling more realistic, multilingual, and multimodal table reasoning. Overall, TABLET significantly advances VTU by enabling scalable, grounded training and more faithful evaluation of table understanding in pixel-based systems.

Abstract

While table understanding increasingly relies on pixel-only settings where tables are processed as visual representations, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 20 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. Each example includes paired image-HTML representations, comprehensive metadata, and provenance information linking back to the source datasets. Fine-tuning vision-language models like Qwen2.5-VL-7B on TABLET improves performance on seen and unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.

Paper Structure

This paper contains 39 sections, 3 equations, 17 figures, 12 tables.

Figures (17)

  • Figure 1: Previous datasets render table images from serialized tables, losing original visual details. In contrast, Tablet locates and retrieves the original table visualizations across 14 tabular datasets, resulting in 4M examples grounded in 2M unique tables.
  • Figure 2: Image source distribution in Tablet, broken down by task. That is, source of the image referred by each example in each task. While the distribution resembles that of the unique image pool, it is computed at the example level (e.g., if the same Wikipedia image appears in two examples of a task, it is counted twice). Wikipedia are original visualizations form Wikipedia. Seed render are synthetic images rendered form information in the seed dataset.
  • Figure 3: Example of a Tablet example for the ToTTo table-to-text task.
  • Figure 4: Instruction length distribution across tasks in Tablet.
  • Figure 5: Example for Entity Linking task based on highlighted table cell.
  • ...and 12 more figures