Table of Contents
Fetching ...

Composited-Nested-Learning with Data Augmentation for Nested Named Entity Recognition

Xingming Liao, Nankai Lin, Haowen Li, Lianglun Cheng, Zhuowei Wang, Chong Chen

TL;DR

This work tackles data scarcity in Nested Named Entity Recognition (NNER) by proposing Composited-Nested-Learning (CNL), built on a Composite-Nested-Label Classification (CNLC) template that jointly models nested tokens and labels. A dynamic data-augmentation pipeline using CNLC and a RoBERTa-based attention mechanism generates diverse, label-aware augmented data, while Confidence Filtering Mechanism (CFM) selects high-confidence samples via pseudo-log-likelihood (PLL) filtering. The approach, evaluated on ACE2004 and ACE2005, yields improved span-level precision/recall/F1 and alleviates sample imbalance, with parameter searches identifying optimal silver data proportions (e.g., 70% for ACE2004 and 35% for ACE2005). The method demonstrates that integrating label correlations into augmentation and applying quality filtering can enhance NNER performance and produce reusable augmented data for broader models. The work also provides an open-source augmented dataset to support future research in nested entity recognition and few-shot scenarios, with potential impact on more accurate information extraction in complex text.

Abstract

Nested Named Entity Recognition (NNER) focuses on addressing overlapped entity recognition. Compared to Flat Named Entity Recognition (FNER), annotated resources are scarce in the corpus for NNER. Data augmentation is an effective approach to address the insufficient annotated corpus. However, there is a significant lack of exploration in data augmentation methods for NNER. Due to the presence of nested entities in NNER, existing data augmentation methods cannot be directly applied to NNER tasks. Therefore, in this work, we focus on data augmentation for NNER and resort to more expressive structures, Composited-Nested-Label Classification (CNLC) in which constituents are combined by nested-word and nested-label, to model nested entities. The dataset is augmented using the Composited-Nested-Learning (CNL). In addition, we propose the Confidence Filtering Mechanism (CFM) for a more efficient selection of generated data. Experimental results demonstrate that this approach results in improvements in ACE2004 and ACE2005 and alleviates the impact of sample imbalance.

Composited-Nested-Learning with Data Augmentation for Nested Named Entity Recognition

TL;DR

This work tackles data scarcity in Nested Named Entity Recognition (NNER) by proposing Composited-Nested-Learning (CNL), built on a Composite-Nested-Label Classification (CNLC) template that jointly models nested tokens and labels. A dynamic data-augmentation pipeline using CNLC and a RoBERTa-based attention mechanism generates diverse, label-aware augmented data, while Confidence Filtering Mechanism (CFM) selects high-confidence samples via pseudo-log-likelihood (PLL) filtering. The approach, evaluated on ACE2004 and ACE2005, yields improved span-level precision/recall/F1 and alleviates sample imbalance, with parameter searches identifying optimal silver data proportions (e.g., 70% for ACE2004 and 35% for ACE2005). The method demonstrates that integrating label correlations into augmentation and applying quality filtering can enhance NNER performance and produce reusable augmented data for broader models. The work also provides an open-source augmented dataset to support future research in nested entity recognition and few-shot scenarios, with potential impact on more accurate information extraction in complex text.

Abstract

Nested Named Entity Recognition (NNER) focuses on addressing overlapped entity recognition. Compared to Flat Named Entity Recognition (FNER), annotated resources are scarce in the corpus for NNER. Data augmentation is an effective approach to address the insufficient annotated corpus. However, there is a significant lack of exploration in data augmentation methods for NNER. Due to the presence of nested entities in NNER, existing data augmentation methods cannot be directly applied to NNER tasks. Therefore, in this work, we focus on data augmentation for NNER and resort to more expressive structures, Composited-Nested-Label Classification (CNLC) in which constituents are combined by nested-word and nested-label, to model nested entities. The dataset is augmented using the Composited-Nested-Learning (CNL). In addition, we propose the Confidence Filtering Mechanism (CFM) for a more efficient selection of generated data. Experimental results demonstrate that this approach results in improvements in ACE2004 and ACE2005 and alleviates the impact of sample imbalance.
Paper Structure (15 sections, 2 equations, 4 figures, 4 tables)

This paper contains 15 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Three different patterns for recognizing label sequence templates, with each distinguished by different colors for their respective label formats. The left part represents the Flat-NER classification template, the middle part represents the current NNER classification template, and the right part represents the CNLC classification template we propose.
  • Figure 2: Model architecture of CNL: Model CNL is divided into five steps, serving as the input to the model during fine-tuning and generation. Step 1: Similar sentences are obtained from the corpus using a similarity filtering mechanism. Then, the important keywords related to NEs are extracted using attention maps obtained from the fine-tuned RoBERTa model. Step 2: After adding label tokens before and after each entity in the sentences using CNLC, the sentences are divided into two parts. The original sentence template undergoes further masking, with a small portion of keywords dynamically masked. The other part is obtained by merging the sentence with similar sentences using a FUSION mechanism to create a template. Step 3: The CNL model is used to generate augmented data. Step 4: The generated samples are further filtered through the CFM to obtain high-confidence sentences, which are then concatenated with the golden data. Step 5: The obtained final data is used as the input for model $M$.
  • Figure 3: After acquiring data-augmented samples, the samples are subsequently filtered using the CFM. Within the sample filtering process, sentences with low PLLs are excluded, and high-confidence sentences are retained as our final silver dataset.
  • Figure 4: A parameter search was conducted for the silver dataset generated for ACE2004. The Rate represents the proportion of silver selected, and we tested it using model $M$, obtaining F-micro and F-macro scores at different proportions.