Table of Contents
Fetching ...

The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts

I. de Rodrigo, A. Sanchez-Cuadrado, J. Boal, A. J. Lopez-Lopez

TL;DR

The paper addresses the challenge of visually rich document understanding by introducing the MERIT Dataset, a multimodal corpus that fuses text, images, and layout annotations across 33k samples and 400+ labels. It presents a dataset generation pipeline and analyzes features across textual, visual, layout, and bias dimensions, demonstrating utility with a token-classification benchmark that remains challenging for state-of-the-art models. The work highlights the practical value of including MERIT in pretraining to improve VrDU performance and provides a framework for bias benchmarking in LLM-driven analyses. Overall, MERIT enables bias-aware evaluation and robust training of interpretable transcript rendering models in Visually-rich Document Understanding tasks.

Abstract

This paper introduces the MERIT Dataset, a multimodal (text + image + layout) fully labeled dataset within the context of school reports. Comprising over 400 labels and 33k samples, the MERIT Dataset is a valuable resource for training models in demanding Visually-rich Document Understanding (VrDU) tasks. By its nature (student grade reports), the MERIT Dataset can potentially include biases in a controlled way, making it a valuable tool to benchmark biases induced in Language Models (LLMs). The paper outlines the dataset's generation pipeline and highlights its main features in the textual, visual, layout, and bias domains. To demonstrate the dataset's utility, we present a benchmark with token classification models, showing that the dataset poses a significant challenge even for SOTA models and that these would greatly benefit from including samples from the MERIT Dataset in their pretraining phase.

The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts

TL;DR

The paper addresses the challenge of visually rich document understanding by introducing the MERIT Dataset, a multimodal corpus that fuses text, images, and layout annotations across 33k samples and 400+ labels. It presents a dataset generation pipeline and analyzes features across textual, visual, layout, and bias dimensions, demonstrating utility with a token-classification benchmark that remains challenging for state-of-the-art models. The work highlights the practical value of including MERIT in pretraining to improve VrDU performance and provides a framework for bias benchmarking in LLM-driven analyses. Overall, MERIT enables bias-aware evaluation and robust training of interpretable transcript rendering models in Visually-rich Document Understanding tasks.

Abstract

This paper introduces the MERIT Dataset, a multimodal (text + image + layout) fully labeled dataset within the context of school reports. Comprising over 400 labels and 33k samples, the MERIT Dataset is a valuable resource for training models in demanding Visually-rich Document Understanding (VrDU) tasks. By its nature (student grade reports), the MERIT Dataset can potentially include biases in a controlled way, making it a valuable tool to benchmark biases induced in Language Models (LLMs). The paper outlines the dataset's generation pipeline and highlights its main features in the textual, visual, layout, and bias domains. To demonstrate the dataset's utility, we present a benchmark with token classification models, showing that the dataset poses a significant challenge even for SOTA models and that these would greatly benefit from including samples from the MERIT Dataset in their pretraining phase.
Paper Structure (3 sections)

This paper contains 3 sections.