Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking
Jian Chen, Jiabao Dou
TL;DR
The paper tackles automated rule checking (ARC) in the AEC domain by addressing the domain gap between general pre-trained models and regulatory texts for NER. It introduces ARCE, a three-stage framework that uses an LLM to generate a Corpus of contextualized task-oriented elucidations (Cote) to incrementally pre-train RoBERTa, followed by CRF-based NER fine-tuning. ARCE achieves state-of-the-art Macro-F1 on a public AEC NER benchmark (77.20%), outperforming domain-adapted baselines and zero-shot/few-shot LLMs, while demonstrating data efficiency (e.g., strong performance with only 25% of Cote). A key finding is the less-is-more principle: simple, explanation-based prompts yield superior domain transfer for small models by providing clean semantic signals and reducing noise from complex reasoning. The approach offers a practical, scalable solution for resource-constrained ARC deployments and points to broader applicability across regulatory domains.
Abstract
Accurate information extraction from specialized texts is a critical challenge for automated rule checking (ARC) in the architecture, engineering, and construction (AEC) domain. While large language models (LLMs) possess strong reasoning capabilities, their deployment in resource-constrained AEC environments is often impractical. Conversely, standard efficient models struggle with the significant domain gap. Although this gap can be mitigated by pre-training on large, humancurated corpora, such approaches are labor-intensive and costly. To address this, we propose ARCE (Augmented RoBERTa with Contextualized Elucidations), a novel knowledge distillation framework that leverages LLMs to synthesize a task-oriented corpus, termed Cote, for incrementally pre-training smaller models. ARCE systematically explores the optimal strategy for knowledge transfer. Our extensive experiments demonstrate that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20% and outperforming both domain-specific baselines and fine-tuned LLMs. Crucially, our study reveals a less is more principle: simple, direct explanations prove significantly more effective for domain adaptation than complex, role-based rationales in the NER task, which tend to introduce semantic noise. The source code will be made publicly available upon acceptance.
