Table of Contents
Fetching ...

A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950

Zhao Fang, Liang-Chun Wu, Xuening Kong, Spencer Dean Stewart

TL;DR

The paper addresses NLP on historical Chinese texts (1900–1950) by comparing large language models (LLMs) with traditional tools for word segmentation, POS tagging, and NER using a Shanghai Library corpus ground truth. It demonstrates that LLMs achieve higher accuracy across tasks but incur higher computational costs, and that they better handle poetry and temporal variation (pre- vs post-1920). The study provides evidence that contextual learning in LLMs can advance historical Chinese NLP without extensive domain-specific training data, and it discusses the trade-offs and potential for hybrid approaches. Overall, the work informs digital humanities workflows by highlighting the promise of LLM-based approaches for historical corpora while outlining practical considerations for efficiency and reproducibility.

Abstract

This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER) on Chinese texts from 1900 to 1950. Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes. Using a sample dataset from the Shanghai Library Republican Journal corpus, traditional tools such as Jieba and spaCy are compared to LLMs, including GPT-4o, Claude 3.5, and the GLM series. The results show that LLMs outperform traditional methods in all metrics, albeit at considerably higher computational costs, highlighting a trade-off between accuracy and efficiency. Additionally, LLMs better handle genre-specific challenges such as poetry and temporal variations (i.e., pre-1920 versus post-1920 texts), demonstrating that their contextual learning capabilities can advance NLP approaches to historical texts by reducing the need for domain-specific training data.

A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950

TL;DR

The paper addresses NLP on historical Chinese texts (1900–1950) by comparing large language models (LLMs) with traditional tools for word segmentation, POS tagging, and NER using a Shanghai Library corpus ground truth. It demonstrates that LLMs achieve higher accuracy across tasks but incur higher computational costs, and that they better handle poetry and temporal variation (pre- vs post-1920). The study provides evidence that contextual learning in LLMs can advance historical Chinese NLP without extensive domain-specific training data, and it discusses the trade-offs and potential for hybrid approaches. Overall, the work informs digital humanities workflows by highlighting the promise of LLM-based approaches for historical corpora while outlining practical considerations for efficiency and reproducibility.

Abstract

This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER) on Chinese texts from 1900 to 1950. Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes. Using a sample dataset from the Shanghai Library Republican Journal corpus, traditional tools such as Jieba and spaCy are compared to LLMs, including GPT-4o, Claude 3.5, and the GLM series. The results show that LLMs outperform traditional methods in all metrics, albeit at considerably higher computational costs, highlighting a trade-off between accuracy and efficiency. Additionally, LLMs better handle genre-specific challenges such as poetry and temporal variations (i.e., pre-1920 versus post-1920 texts), demonstrating that their contextual learning capabilities can advance NLP approaches to historical texts by reducing the need for domain-specific training data.

Paper Structure

This paper contains 8 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Boxplot of F1 Score for Temporal Change by Model for non-poetry texts, capturing the median (line), interquartile range (boxes), and spread of data (whiskers).