Table of Contents
Fetching ...

CatalogBank: A Structured and Interoperable Catalog Dataset with a Semi-Automatic Annotation Tool (DocumentLabeler) for Engineering System Design

Hasan Sinan Bank, Daniel R. Herber

TL;DR

The paper addresses the lack of large, structured multimodal datasets for engineering design catalogs and the challenge of integrating textual specifications with geometry. It introduces CatalogBank, a large, digitally born catalog dataset, and DocumentLabeler, an open-source semi-automatic annotation tool, to bridge text with product information. A baseline information-extraction experiment using the PICK model demonstrates high performance on CatalogBank ($mEP \approx 0.99$, $mER \approx 0.99$, $mEF \approx 0.99$, $mEA \approx 0.99$) and strong results on DocBank (~$0.91$). The work provides practical, reusable resources and workflows to advance document engineering, NLP, and downstream design-automation tasks.

Abstract

In the realm of document engineering and Natural Language Processing (NLP), the integration of digitally born catalogs into product design processes presents a novel avenue for enhancing information extraction and interoperability. This paper introduces CatalogBank, a dataset developed to bridge the gap between textual descriptions and other data modalities related to engineering design catalogs. We utilized existing information extraction methodologies to extract product information from PDF-based catalogs to use in downstream tasks to generate a baseline metric. Our approach not only supports the potential automation of design workflows but also overcomes the limitations of manual data entry and non-standard metadata structures that have historically impeded the seamless integration of textual and other data modalities. Through the use of DocumentLabeler, an open-source annotation tool adapted for our dataset, we demonstrated the potential of CatalogBank in supporting diverse document-based tasks such as layout analysis and knowledge extraction. Our findings suggest that CatalogBank can contribute to document engineering and NLP by providing a robust dataset for training models capable of understanding and processing complex document formats with relatively less effort using the semi-automated annotation tool DocumentLabeler.

CatalogBank: A Structured and Interoperable Catalog Dataset with a Semi-Automatic Annotation Tool (DocumentLabeler) for Engineering System Design

TL;DR

The paper addresses the lack of large, structured multimodal datasets for engineering design catalogs and the challenge of integrating textual specifications with geometry. It introduces CatalogBank, a large, digitally born catalog dataset, and DocumentLabeler, an open-source semi-automatic annotation tool, to bridge text with product information. A baseline information-extraction experiment using the PICK model demonstrates high performance on CatalogBank (, , , ) and strong results on DocBank (~). The work provides practical, reusable resources and workflows to advance document engineering, NLP, and downstream design-automation tasks.

Abstract

In the realm of document engineering and Natural Language Processing (NLP), the integration of digitally born catalogs into product design processes presents a novel avenue for enhancing information extraction and interoperability. This paper introduces CatalogBank, a dataset developed to bridge the gap between textual descriptions and other data modalities related to engineering design catalogs. We utilized existing information extraction methodologies to extract product information from PDF-based catalogs to use in downstream tasks to generate a baseline metric. Our approach not only supports the potential automation of design workflows but also overcomes the limitations of manual data entry and non-standard metadata structures that have historically impeded the seamless integration of textual and other data modalities. Through the use of DocumentLabeler, an open-source annotation tool adapted for our dataset, we demonstrated the potential of CatalogBank in supporting diverse document-based tasks such as layout analysis and knowledge extraction. Our findings suggest that CatalogBank can contribute to document engineering and NLP by providing a robust dataset for training models capable of understanding and processing complex document formats with relatively less effort using the semi-automated annotation tool DocumentLabeler.
Paper Structure (7 sections, 3 equations, 6 figures, 3 tables)

This paper contains 7 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Evolution of engineering design from mechanical to generative tools weisberg2008engineeringnash2020polygensiddiqui2024meshgpt.
  • Figure 2: A sample for presenting the overall view of the complete CatalogBank dataset from McMaster Carr v125.
  • Figure 3: The details of document annotation and information extraction workflow Document Dataset (ML Model: PICK yu2021pick or others).
  • Figure 4: a) Digitally-born Catalogs in PDF and b) after preprocessing (Peruse of Step 0 from Fig. \ref{['fig:3']}) from well-known vendors such as Misumi, Newark, Thorlabs, McMaster-Carr, 8020, and Grainger, respectively.
  • Figure 5: a) Importing Data, b) Manual Operations (Labeling, Merging, or Deleting), and c) Inference on Selected Model
  • ...and 1 more figures