Table of Contents
Fetching ...

Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking

Jian Chen, Jiabao Dou

TL;DR

The paper tackles automated rule checking (ARC) in the AEC domain by addressing the domain gap between general pre-trained models and regulatory texts for NER. It introduces ARCE, a three-stage framework that uses an LLM to generate a Corpus of contextualized task-oriented elucidations (Cote) to incrementally pre-train RoBERTa, followed by CRF-based NER fine-tuning. ARCE achieves state-of-the-art Macro-F1 on a public AEC NER benchmark (77.20%), outperforming domain-adapted baselines and zero-shot/few-shot LLMs, while demonstrating data efficiency (e.g., strong performance with only 25% of Cote). A key finding is the less-is-more principle: simple, explanation-based prompts yield superior domain transfer for small models by providing clean semantic signals and reducing noise from complex reasoning. The approach offers a practical, scalable solution for resource-constrained ARC deployments and points to broader applicability across regulatory domains.

Abstract

Accurate information extraction from specialized texts is a critical challenge for automated rule checking (ARC) in the architecture, engineering, and construction (AEC) domain. While large language models (LLMs) possess strong reasoning capabilities, their deployment in resource-constrained AEC environments is often impractical. Conversely, standard efficient models struggle with the significant domain gap. Although this gap can be mitigated by pre-training on large, humancurated corpora, such approaches are labor-intensive and costly. To address this, we propose ARCE (Augmented RoBERTa with Contextualized Elucidations), a novel knowledge distillation framework that leverages LLMs to synthesize a task-oriented corpus, termed Cote, for incrementally pre-training smaller models. ARCE systematically explores the optimal strategy for knowledge transfer. Our extensive experiments demonstrate that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20% and outperforming both domain-specific baselines and fine-tuned LLMs. Crucially, our study reveals a less is more principle: simple, direct explanations prove significantly more effective for domain adaptation than complex, role-based rationales in the NER task, which tend to introduce semantic noise. The source code will be made publicly available upon acceptance.

Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking

TL;DR

The paper tackles automated rule checking (ARC) in the AEC domain by addressing the domain gap between general pre-trained models and regulatory texts for NER. It introduces ARCE, a three-stage framework that uses an LLM to generate a Corpus of contextualized task-oriented elucidations (Cote) to incrementally pre-train RoBERTa, followed by CRF-based NER fine-tuning. ARCE achieves state-of-the-art Macro-F1 on a public AEC NER benchmark (77.20%), outperforming domain-adapted baselines and zero-shot/few-shot LLMs, while demonstrating data efficiency (e.g., strong performance with only 25% of Cote). A key finding is the less-is-more principle: simple, explanation-based prompts yield superior domain transfer for small models by providing clean semantic signals and reducing noise from complex reasoning. The approach offers a practical, scalable solution for resource-constrained ARC deployments and points to broader applicability across regulatory domains.

Abstract

Accurate information extraction from specialized texts is a critical challenge for automated rule checking (ARC) in the architecture, engineering, and construction (AEC) domain. While large language models (LLMs) possess strong reasoning capabilities, their deployment in resource-constrained AEC environments is often impractical. Conversely, standard efficient models struggle with the significant domain gap. Although this gap can be mitigated by pre-training on large, humancurated corpora, such approaches are labor-intensive and costly. To address this, we propose ARCE (Augmented RoBERTa with Contextualized Elucidations), a novel knowledge distillation framework that leverages LLMs to synthesize a task-oriented corpus, termed Cote, for incrementally pre-training smaller models. ARCE systematically explores the optimal strategy for knowledge transfer. Our extensive experiments demonstrate that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20% and outperforming both domain-specific baselines and fine-tuned LLMs. Crucially, our study reveals a less is more principle: simple, direct explanations prove significantly more effective for domain adaptation than complex, role-based rationales in the NER task, which tend to introduce semantic noise. The source code will be made publicly available upon acceptance.

Paper Structure

This paper contains 13 sections, 8 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The overall architecture of our proposed ARCE approach. The approach consists of three stages. Stage 1: We employ a LLM (e.g., Qwen3) to generate a Cote (contextualized task-oriented elucidation) corpus from raw domain texts using specialized prompts. Stage 2: A pre-trained RoBERTa-wwm-ext model undergoes incremental pre-training on the Cote corpus via a masked language modeling (MLM) objective, yielding the enhanced ARCE model. Stage 3: The ARCE model is augmented with a CRF layer and fine-tuned on the downstream NER task, and is then used for inference on new, unseen texts.
  • Figure 2: Comparison of knowledge generation strategies. Left (Strategy A): The prompt used in our ARCE framework, focusing on simple explanations. Right (Strategy B): A complex prompt design forcing deep role analysis.
  • Figure 3: Results of the ablation experiment.
  • Figure 4: Data Efficiency and Scalability Analysis.