Named Entity Recognition in Context
Colin Brisson, Ayoub Kahfy, Marc Bui, Frédéric Constant
TL;DR
This paper tackles NER for Classical Chinese texts within the EvaHan2025 framework, addressing the challenges of low-resource, historical language data. It proposes Pindola, a transformer-based bidirectional encoder, augmented with a retrieval module for external context and a generative context-summarization step to improve entity disambiguation. The authors augment training data with a merged pretraining set totaling 12,007 annotated sequences and report an overall F1 of 85.58, exceeding the baseline by about 5 points. Ablation and analysis indicate only modest gains from external context and pretraining in this setup, prompting questions about when context is necessary and motivating more efficient approaches for historical NER.
Abstract
We present the Named Entity Recognition system developed by the Edit Dunhuang team for the EvaHan2025 competition. Our approach integrates three core components: (1) Pindola, a modern transformer-based bidirectional encoder pretrained on a large corpus of Classical Chinese texts; (2) a retrieval module that fetches relevant external context for each target sequence; and (3) a generative reasoning step that summarizes retrieved context in Classical Chinese for more robust entity disambiguation. Using this approach, we achieve an average F1 score of 85.58, improving upon the competition baseline by nearly 5 points.
