Table of Contents
Fetching ...

ManufactuBERT: Efficient Continual Pretraining for Manufacturing

Robin Armingaud, Romaric Besançon

TL;DR

ManufactuBERT tackles the domain gap in manufacturing NLP by continually pretraining a RoBERTa-based encoder on a carefully curated manufacturing corpus. The authors design a two-stage data pipeline that filters a large web-derived corpus and then applies SemDeDup-based deduplication to reduce redundancy, enabling faster convergence with a smaller, higher-quality dataset. Empirical results show ManufactuBERTD achieving state-of-the-art performance on several manufacturing NLP tasks while maintaining solid general-language abilities on GLUE, and they quantify substantial training-time and energy savings from deduplication. The work provides a reproducible framework for domain adaptation in specialized fields and demonstrates that data quality and diversity can trump sheer dataset size in domain-specific pretraining.

Abstract

While large general-purpose Transformer-based encoders excel at general language understanding, their performance diminishes in specialized domains like manufacturing due to a lack of exposure to domain-specific terminology and semantics. In this paper, we address this gap by introducing ManufactuBERT, a RoBERTa model continually pretrained on a large-scale corpus curated for the manufacturing domain. We present a comprehensive data processing pipeline to create this corpus from web data, involving an initial domain-specific filtering step followed by a multi-stage deduplication process that removes redundancies. Our experiments show that ManufactuBERT establishes a new state-of-the-art on a range of manufacturing-related NLP tasks, outperforming strong specialized baselines. More importantly, we demonstrate that training on our carefully deduplicated corpus significantly accelerates convergence, leading to a 33\% reduction in training time and computational cost compared to training on the non-deduplicated dataset. The proposed pipeline offers a reproducible example for developing high-performing encoders in other specialized domains. We will release our model and curated corpus at https://huggingface.co/cea-list-ia.

ManufactuBERT: Efficient Continual Pretraining for Manufacturing

TL;DR

ManufactuBERT tackles the domain gap in manufacturing NLP by continually pretraining a RoBERTa-based encoder on a carefully curated manufacturing corpus. The authors design a two-stage data pipeline that filters a large web-derived corpus and then applies SemDeDup-based deduplication to reduce redundancy, enabling faster convergence with a smaller, higher-quality dataset. Empirical results show ManufactuBERTD achieving state-of-the-art performance on several manufacturing NLP tasks while maintaining solid general-language abilities on GLUE, and they quantify substantial training-time and energy savings from deduplication. The work provides a reproducible framework for domain adaptation in specialized fields and demonstrates that data quality and diversity can trump sheer dataset size in domain-specific pretraining.

Abstract

While large general-purpose Transformer-based encoders excel at general language understanding, their performance diminishes in specialized domains like manufacturing due to a lack of exposure to domain-specific terminology and semantics. In this paper, we address this gap by introducing ManufactuBERT, a RoBERTa model continually pretrained on a large-scale corpus curated for the manufacturing domain. We present a comprehensive data processing pipeline to create this corpus from web data, involving an initial domain-specific filtering step followed by a multi-stage deduplication process that removes redundancies. Our experiments show that ManufactuBERT establishes a new state-of-the-art on a range of manufacturing-related NLP tasks, outperforming strong specialized baselines. More importantly, we demonstrate that training on our carefully deduplicated corpus significantly accelerates convergence, leading to a 33\% reduction in training time and computational cost compared to training on the non-deduplicated dataset. The proposed pipeline offers a reproducible example for developing high-performing encoders in other specialized domains. We will release our model and curated corpus at https://huggingface.co/cea-list-ia.

Paper Structure

This paper contains 24 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Workflow of the data filtering and deduplication steps used to create the ManufactuBERT pretraining corpus.
  • Figure 2: Performance evolution of ManufactuBERT and ManufactuBERTD on the FabNER dataset across 17,500 training steps.
  • Figure 3: Performance evolution of ManufactuBERTC and ManufactuBERTD on the FabNER dataset across 12,500 training steps.
  • Figure 4: Performance evolution of ManufactuBERTD4 and ManufactuBERTD on the FabNER dataset across 12,500 training steps.