Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models
Zihong Zhang, Liqi He, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du
TL;DR
This work reframes word segmentation as a probe of semantic understanding by Large Language Models (LLMs) and introduces the LLM-Word Segmentation (LLM-WS) framework for unsupervised segmentation across languages lacking explicit word boundaries. It then presents LLACA, a dynamic Aho-Corasick automaton informed by LLM derived vocabularies and a variable $n$-gram model with Viterbi decoding, augmented by PMI filtering to minimize hallucinations. Empirical results show that larger LLMs improve segmentation performance and that LLACA delivers faster, more robust unsupervised segmentation, often surpassing direct LLM outputs while reducing computational cost. The study covers multilingual data (Chinese, Japanese, Korean, Thai) and includes thorough OOV handling analysis, highlighting the practical benefits for domain adaptation and downstream NLP tasks. Overall, the paper demonstrates a shift from segment-first to comprehend-first paradigms in NLP and provides a scalable, domain adaptable framework for unsupervised word segmentation with broad implications for language understanding research.
Abstract
Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of "comprehend first, segment later", we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension". Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA ($\textbf{L}$arge $\textbf{L}$anguage Model-Inspired $\textbf{A}$ho-$\textbf{C}$orasick $\textbf{A}$utomaton). Leveraging the advanced pattern recognition capabilities of Aho-Corasick automata, LLACA innovatively combines these with the deep insights of well-pretrained LLMs. This approach not only enables the construction of a dynamic $n$-gram model that adjusts based on contextual information but also integrates the nuanced understanding of LLMs, offering significant improvements over traditional methods. Our source code is available at https://github.com/hkr04/LLACA
