Table of Contents
Fetching ...

Small Language Model Makes an Effective Long Text Extractor

Yelin Chen, Fanjin Zhang, Jie Tang

TL;DR

This work tackles the challenge of named entity recognition on long texts, where entities form long spans that cross sentences. It introduces SeNER, a lightweight span-based model that combines a bidirectional arrow attention encoder with LogN-Scaling on the [CLS] token and a BiSPA-based token-pair interaction module to efficiently capture global and local context. Through extensive experiments on three long-NER datasets, SeNER achieves state-of-the-art accuracy while maintaining GPU-memory efficiency and handling substantially longer inputs than prior span-based methods. Ablation and analyses validate the value of arrow attention, BiSPA, LogN-Scaling, and training strategies (LoRA, WWM) for robustness and efficiency. Overall, the approach offers a practical and scalable solution to long-text information extraction with strong performance gains over existing methods.

Abstract

Named Entity Recognition (NER) is a fundamental problem in natural language processing (NLP). However, the task of extracting longer entity spans (e.g., awards) from extended texts (e.g., homepages) is barely explored. Current NER methods predominantly fall into two categories: span-based methods and generation-based methods. Span-based methods require the enumeration of all possible token-pair spans, followed by classification on each span, resulting in substantial redundant computations and excessive GPU memory usage. In contrast, generation-based methods involve prompting or fine-tuning large language models (LLMs) to adapt to downstream NER tasks. However, these methods struggle with the accurate generation of longer spans and often incur significant time costs for effective fine-tuning. To address these challenges, this paper introduces a lightweight span-based NER method called SeNER, which incorporates a bidirectional arrow attention mechanism coupled with LogN-Scaling on the [CLS] token to embed long texts effectively, and comprises a novel bidirectional sliding-window plus-shaped attention (BiSPA) mechanism to reduce redundant candidate token-pair spans significantly and model interactions between token-pair spans simultaneously. Extensive experiments demonstrate that our method achieves state-of-the-art extraction accuracy on three long NER datasets and is capable of extracting entities from long texts in a GPU-memory-friendly manner. Code: https://github.com/THUDM/scholar-profiling/tree/main/sener

Small Language Model Makes an Effective Long Text Extractor

TL;DR

This work tackles the challenge of named entity recognition on long texts, where entities form long spans that cross sentences. It introduces SeNER, a lightweight span-based model that combines a bidirectional arrow attention encoder with LogN-Scaling on the [CLS] token and a BiSPA-based token-pair interaction module to efficiently capture global and local context. Through extensive experiments on three long-NER datasets, SeNER achieves state-of-the-art accuracy while maintaining GPU-memory efficiency and handling substantially longer inputs than prior span-based methods. Ablation and analyses validate the value of arrow attention, BiSPA, LogN-Scaling, and training strategies (LoRA, WWM) for robustness and efficiency. Overall, the approach offers a practical and scalable solution to long-text information extraction with strong performance gains over existing methods.

Abstract

Named Entity Recognition (NER) is a fundamental problem in natural language processing (NLP). However, the task of extracting longer entity spans (e.g., awards) from extended texts (e.g., homepages) is barely explored. Current NER methods predominantly fall into two categories: span-based methods and generation-based methods. Span-based methods require the enumeration of all possible token-pair spans, followed by classification on each span, resulting in substantial redundant computations and excessive GPU memory usage. In contrast, generation-based methods involve prompting or fine-tuning large language models (LLMs) to adapt to downstream NER tasks. However, these methods struggle with the accurate generation of longer spans and often incur significant time costs for effective fine-tuning. To address these challenges, this paper introduces a lightweight span-based NER method called SeNER, which incorporates a bidirectional arrow attention mechanism coupled with LogN-Scaling on the [CLS] token to embed long texts effectively, and comprises a novel bidirectional sliding-window plus-shaped attention (BiSPA) mechanism to reduce redundant candidate token-pair spans significantly and model interactions between token-pair spans simultaneously. Extensive experiments demonstrate that our method achieves state-of-the-art extraction accuracy on three long NER datasets and is capable of extracting entities from long texts in a GPU-memory-friendly manner. Code: https://github.com/THUDM/scholar-profiling/tree/main/sener

Paper Structure

This paper contains 28 sections, 9 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: An example of entity/attribute extraction from an author's homepage, where "work experience" is a long entity block and "award" is a long entity.
  • Figure 2: Performance of entity recognition with respect to input length and the number of model parameters of NER methods.
  • Figure 3: An overview of the SeNER model.
  • Figure 4: Illustration of arrow attention, full attention, and sliding window attention.
  • Figure 5: Diagram of the transformation for the token-pair span tensors in BiSPA mechanism.
  • ...and 3 more figures