Table of Contents
Fetching ...

Slovak Conceptual Dictionary

Miroslav Blšták

TL;DR

This paper presents a pioneering Slovak conceptual dictionary designed to overcome the lack of machine-readable semantic resources for Slovak NLP. It describes a taxonomy-based data model with concepts, categories, and relationships, enriched with POS, gender, multiword expressions, and translations, and built atop Eduself.sk data with iterative/manual curation. The resource, accessible via web and API, comprises over 145k concepts and 355k relationships, and is shown to support information extraction, coreference resolution, semantic similarity, and bias-aware dataset generation, with demonstrated utility in migration-related text analysis. The work establishes the dictionary as the largest Slovak semantic resource to date and outlines future API enhancements and data/tool expansions to broaden applicability in NLP and education.

Abstract

When solving tasks in the field of natural language processing, we sometimes need dictionary tools, such as lexicons, word form dictionaries or knowledge bases. However, the availability of dictionary data is insufficient in many languages, especially in the case of low resourced languages. In this article, we introduce a new conceptual dictionary for the Slovak language as the first linguistic tool of this kind. Since Slovak language is a language with limited linguistic resources and there are currently not available any machine-readable linguistic data sources with a sufficiently large volume of data, many tasks which require automated processing of Slovak text achieve weaker results compared to other languages and are almost impossible to solve.

Slovak Conceptual Dictionary

TL;DR

This paper presents a pioneering Slovak conceptual dictionary designed to overcome the lack of machine-readable semantic resources for Slovak NLP. It describes a taxonomy-based data model with concepts, categories, and relationships, enriched with POS, gender, multiword expressions, and translations, and built atop Eduself.sk data with iterative/manual curation. The resource, accessible via web and API, comprises over 145k concepts and 355k relationships, and is shown to support information extraction, coreference resolution, semantic similarity, and bias-aware dataset generation, with demonstrated utility in migration-related text analysis. The work establishes the dictionary as the largest Slovak semantic resource to date and outlines future API enhancements and data/tool expansions to broaden applicability in NLP and education.

Abstract

When solving tasks in the field of natural language processing, we sometimes need dictionary tools, such as lexicons, word form dictionaries or knowledge bases. However, the availability of dictionary data is insufficient in many languages, especially in the case of low resourced languages. In this article, we introduce a new conceptual dictionary for the Slovak language as the first linguistic tool of this kind. Since Slovak language is a language with limited linguistic resources and there are currently not available any machine-readable linguistic data sources with a sufficiently large volume of data, many tasks which require automated processing of Slovak text achieve weaker results compared to other languages and are almost impossible to solve.

Paper Structure

This paper contains 6 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: example of data structure (concepts in taxonomy structure and relationships between concepts)
  • Figure 2: Example of a conceptual network.