Table of Contents
Fetching ...

NeoN: A Tool for Automated Detection, Linguistic and LLM-Driven Analysis of Neologisms in Polish

Aleksandra Tomaszewska, Dariusz Czerski, Bartosz Żuk, Maciej Ogrodniczuk

TL;DR

NeoN tackles the challenge of Polish neologism detection by moving beyond dictionary-based methods to a RSS-driven, multi-layered pipeline that fuses corpus filtering with an LLM-based precision boost. It combines four high-quality Polish corpora, language-specific form filtering and grouping, and an integrated LLM module for automatic definitions and multidimensional categorization by domain and sentiment, all accessible through a user-friendly interface. Empirical results show substantial gains in precision with limited manual effort, while lemmatization and definition-generation experiments demonstrate the value of specialized tools and LLMs for Polish morphology and semantics. The work enables scalable, real-time tracking of lexical innovation in Polish and outlines a path toward open benchmarks, user-supplied corpora, and more efficient models.

Abstract

NeoN, a tool for detecting and analyzing Polish neologisms. Unlike traditional dictionary-based methods requiring extensive manual review, NeoN combines reference corpora, Polish-specific linguistic filters, an LLM-driven precision-boosting filter, and daily RSS monitoring in a multi-layered pipeline. The system uses context-aware lemmatization, frequency analysis, and orthographic normalization to extract candidate neologisms while consolidating inflectional variants. Researchers can verify candidates through an intuitive interface with visualizations and filtering controls. An integrated LLM module automatically generates definitions and categorizes neologisms by domain and sentiment. Evaluations show NeoN maintains high accuracy while significantly reducing manual effort, providing an accessible solution for tracking lexical innovation in Polish.

NeoN: A Tool for Automated Detection, Linguistic and LLM-Driven Analysis of Neologisms in Polish

TL;DR

NeoN tackles the challenge of Polish neologism detection by moving beyond dictionary-based methods to a RSS-driven, multi-layered pipeline that fuses corpus filtering with an LLM-based precision boost. It combines four high-quality Polish corpora, language-specific form filtering and grouping, and an integrated LLM module for automatic definitions and multidimensional categorization by domain and sentiment, all accessible through a user-friendly interface. Empirical results show substantial gains in precision with limited manual effort, while lemmatization and definition-generation experiments demonstrate the value of specialized tools and LLMs for Polish morphology and semantics. The work enables scalable, real-time tracking of lexical innovation in Polish and outlines a path toward open benchmarks, user-supplied corpora, and more efficient models.

Abstract

NeoN, a tool for detecting and analyzing Polish neologisms. Unlike traditional dictionary-based methods requiring extensive manual review, NeoN combines reference corpora, Polish-specific linguistic filters, an LLM-driven precision-boosting filter, and daily RSS monitoring in a multi-layered pipeline. The system uses context-aware lemmatization, frequency analysis, and orthographic normalization to extract candidate neologisms while consolidating inflectional variants. Researchers can verify candidates through an intuitive interface with visualizations and filtering controls. An integrated LLM module automatically generates definitions and categorizes neologisms by domain and sentiment. Evaluations show NeoN maintains high accuracy while significantly reducing manual effort, providing an accessible solution for tracking lexical innovation in Polish.

Paper Structure

This paper contains 20 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: NeoN: overview of the monitoring interface.
  • Figure 2: Accuracy of DeepSeek-R1 and Llama-70B in pointwise evaluation across three prompting setups.
  • Figure 3: Win rate of DeepSeek-R1 and Llama-70B in pairwise evaluation against human-made definition across three prompting setups.
  • Figure 4: Results of pointwise meta evaluation shown for 3 human annotators and GPT4o (LLM judge) across two judged models: Llama-70B and DeepSeek-R1.
  • Figure 5: Results of pairwise meta evaluation, shown for 3 human annotators and GPT4o (LLM judge) across two judged models: Llama-70B and DeepSeek-R1.