Table of Contents
Fetching ...

Label Errors in the Tobacco3482 Dataset

Gordon Lim, Stefan Larson, Kevin Leach

TL;DR

This paper audits the Tobacco3482 document classification dataset for label quality, revealing substantial label issues: 11.7% of samples are unknown or mis-labeled and 16.7% have multiple valid labels. By establishing annotation guidelines and re-annotating the data, the authors demonstrate that 35% of a top transformer model's errors on the original dataset are attributable to label problems, and correcting for these issues raises the observed accuracy from 84.1% to 89.7%. The findings, aligned with broader RVL-CDIP observations, caution against overreliance on noisy benchmarks and emphasize the need for guideline-driven labeling and dataset revision. The work highlights the practical impact on benchmarking, underscores potential biases, and advocates for more robust evaluation practices in document-understanding research.

Abstract

Tobacco3482 is a widely used document classification benchmark dataset. However, our manual inspection of the entire dataset uncovers widespread ontological issues, especially large amounts of annotation label problems in the dataset. We establish data label guidelines and find that 11.7% of the dataset is improperly annotated and should either have an unknown label or a corrected label, and 16.7% of samples in the dataset have multiple valid labels. We then analyze the mistakes of a top-performing model and find that 35% of the model's mistakes can be directly attributed to these label issues, highlighting the inherent problems with using a noisily labeled dataset as a benchmark. Supplementary material, including dataset annotations and code, is available at https://github.com/gordon-lim/tobacco3482-mistakes/.

Label Errors in the Tobacco3482 Dataset

TL;DR

This paper audits the Tobacco3482 document classification dataset for label quality, revealing substantial label issues: 11.7% of samples are unknown or mis-labeled and 16.7% have multiple valid labels. By establishing annotation guidelines and re-annotating the data, the authors demonstrate that 35% of a top transformer model's errors on the original dataset are attributable to label problems, and correcting for these issues raises the observed accuracy from 84.1% to 89.7%. The findings, aligned with broader RVL-CDIP observations, caution against overreliance on noisy benchmarks and emphasize the need for guideline-driven labeling and dataset revision. The work highlights the practical impact on benchmarking, underscores potential biases, and advocates for more robust evaluation practices in document-understanding research.

Abstract

Tobacco3482 is a widely used document classification benchmark dataset. However, our manual inspection of the entire dataset uncovers widespread ontological issues, especially large amounts of annotation label problems in the dataset. We establish data label guidelines and find that 11.7% of the dataset is improperly annotated and should either have an unknown label or a corrected label, and 16.7% of samples in the dataset have multiple valid labels. We then analyze the mistakes of a top-performing model and find that 35% of the model's mistakes can be directly attributed to these label issues, highlighting the inherent problems with using a noisily labeled dataset as a benchmark. Supplementary material, including dataset annotations and code, is available at https://github.com/gordon-lim/tobacco3482-mistakes/.

Paper Structure

This paper contains 9 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Top row: un-problematic document images from Tobacco3482. Bottom row: samples from Tobacco3482 that are erroneously labeled (left and center) or could have multiple valid labels (right). Bottom left: the document is not a news article. Bottom center: the document should be labeled as Memo and not Letter. Bottom right: the image contains both a letter (background) and a note (foreground), and thus could have two valid labels (Note and Letter).
  • Figure 2: Examples of problematic samples from Tobacco3482. Top row: documents where the valid label is unknown. Middle row: documents that have the wrong original label (shown in red (top label) with corrected label shown in blue (bottom label)). Bottom row: documents that have multiple valid labels (original label shown on top, with additional valid label shown on bottom).
  • Figure 3: UpSet plot upset-plot-2014 of multi-label Tobacco3482 category annotations. A majority of the documents with multiple labels are documents that are both Reports and Memos.