Composited-Nested-Learning with Data Augmentation for Nested Named Entity Recognition
Xingming Liao, Nankai Lin, Haowen Li, Lianglun Cheng, Zhuowei Wang, Chong Chen
TL;DR
This work tackles data scarcity in Nested Named Entity Recognition (NNER) by proposing Composited-Nested-Learning (CNL), built on a Composite-Nested-Label Classification (CNLC) template that jointly models nested tokens and labels. A dynamic data-augmentation pipeline using CNLC and a RoBERTa-based attention mechanism generates diverse, label-aware augmented data, while Confidence Filtering Mechanism (CFM) selects high-confidence samples via pseudo-log-likelihood (PLL) filtering. The approach, evaluated on ACE2004 and ACE2005, yields improved span-level precision/recall/F1 and alleviates sample imbalance, with parameter searches identifying optimal silver data proportions (e.g., 70% for ACE2004 and 35% for ACE2005). The method demonstrates that integrating label correlations into augmentation and applying quality filtering can enhance NNER performance and produce reusable augmented data for broader models. The work also provides an open-source augmented dataset to support future research in nested entity recognition and few-shot scenarios, with potential impact on more accurate information extraction in complex text.
Abstract
Nested Named Entity Recognition (NNER) focuses on addressing overlapped entity recognition. Compared to Flat Named Entity Recognition (FNER), annotated resources are scarce in the corpus for NNER. Data augmentation is an effective approach to address the insufficient annotated corpus. However, there is a significant lack of exploration in data augmentation methods for NNER. Due to the presence of nested entities in NNER, existing data augmentation methods cannot be directly applied to NNER tasks. Therefore, in this work, we focus on data augmentation for NNER and resort to more expressive structures, Composited-Nested-Label Classification (CNLC) in which constituents are combined by nested-word and nested-label, to model nested entities. The dataset is augmented using the Composited-Nested-Learning (CNL). In addition, we propose the Confidence Filtering Mechanism (CFM) for a more efficient selection of generated data. Experimental results demonstrate that this approach results in improvements in ACE2004 and ACE2005 and alleviates the impact of sample imbalance.
