Table of Contents
Fetching ...

J-UniMorph: Japanese Morphological Annotation through the Universal Feature Schema

Kosuke Matsuzaki, Masaya Taniguchi, Kentaro Inui, Keisuke Sakaguchi

TL;DR

A Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema, which addresses the unique and rich verb forms characteristic of the language’s agglutinative nature and is compared with the Wiktionary Edition.

Abstract

We introduce a Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema. This dataset addresses the unique and rich verb forms characteristic of the language's agglutinative nature. J-UniMorph distinguishes itself from the existing Japanese subset of UniMorph, which is automatically extracted from Wiktionary. On average, the Wiktionary Edition features around 12 inflected forms for each word and is primarily dominated by denominal verbs (i.e., [noun] +suru (do-PRS)). Morphologically, this form is equivalent to the verb suru (do). In contrast, J-UniMorph explores a much broader and more frequently used range of verb forms, offering 118 inflected forms for each word on average. It includes honorifics, a range of politeness levels, and other linguistic nuances, emphasizing the distinctive characteristics of the Japanese language. This paper presents detailed statistics and characteristics of J-UniMorph, comparing it with the Wiktionary Edition. We release J-UniMorph and its interactive visualizer publicly available, aiming to support cross-linguistic research and various applications.

J-UniMorph: Japanese Morphological Annotation through the Universal Feature Schema

TL;DR

A Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema, which addresses the unique and rich verb forms characteristic of the language’s agglutinative nature and is compared with the Wiktionary Edition.

Abstract

We introduce a Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema. This dataset addresses the unique and rich verb forms characteristic of the language's agglutinative nature. J-UniMorph distinguishes itself from the existing Japanese subset of UniMorph, which is automatically extracted from Wiktionary. On average, the Wiktionary Edition features around 12 inflected forms for each word and is primarily dominated by denominal verbs (i.e., [noun] +suru (do-PRS)). Morphologically, this form is equivalent to the verb suru (do). In contrast, J-UniMorph explores a much broader and more frequently used range of verb forms, offering 118 inflected forms for each word on average. It includes honorifics, a range of politeness levels, and other linguistic nuances, emphasizing the distinctive characteristics of the Japanese language. This paper presents detailed statistics and characteristics of J-UniMorph, comparing it with the Wiktionary Edition. We release J-UniMorph and its interactive visualizer publicly available, aiming to support cross-linguistic research and various applications.
Paper Structure (35 sections, 4 figures, 9 tables)

This paper contains 35 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overview of the J-UniMorph creation process: First, we generate inflected forms from seed verbs (Table A, detailed in §\ref{['subsec:criteria']}) and inflection suffix (Table B, detailed in §\ref{['subsec:generate']}) using the verb inflection tool, kamiya-codec. This is followed by modifying and adding inflected forms that the tool does not cover (Table C, detailed in §\ref{['subsec:generate']}). Second, Japanese native speakers annotate UniMorph labels to each form (Table D, detailed in §\ref{['sec:feature']}). Finally, we apply a frequency filter to discard infrequent inflected forms (Table E, detailed in §\ref{['subsec:filtering']}).
  • Figure 2: The relationship between the frequency rank of inflected forms and their corresponding number of Google search hits, highlighting a long-tail distribution pattern, regarding J-UniMorph and Wiktionary Edition, respectively. Both graphs exhibit a clear trend shift when the number of hits falls to $10^1$ or fewer. Upon manual review by authors, for J-UniMorph, we concluded that these forms sound unnatural and should be discarded (indicated by the light-blue-colored plots), leaving a total of 12,687 inflected forms in J-UniMorph. Additionally, we found that inflected forms in Wiktionary Edition have fewer hits compared to those in J-UniMorph (detailed in §\ref{['subsec:4comparison']}).
  • Figure 3: Screenshot of J-UniMorph Visualizer, a tool for helping Japanese learners. Users input an inflected form and click the "Search" button to highlight corresponding UniMorph labels. If the inflected form has multiple meanings, they are displayed under the "Search Results" section, with the option to toggle between meanings. Additionally, "Related Words" section displays other inflected forms with the same label (including itself). Confidence values, ranging from 0 to 100 and based on Google search hits, assist users in determining which inflected form should be used. Higher values indicate more hits. Users also can switch between labels to investigate inflected forms with different meanings.
  • Figure 4: Correspondence between the basic forms and the lexical honorifics adopted in J-UniMorph.