LLMs' morphological analyses of complex FST-generated Finnish words

Anssi Moisio; Mathias Creutz; Mikko Kurimo

LLMs' morphological analyses of complex FST-generated Finnish words

Anssi Moisio, Mathias Creutz, Mikko Kurimo

TL;DR

The study probes whether state-of-the-art LLMs learn human-like Finnish morphology by posing explicit morphological classification tasks over a large, FST-generated inflection set. Using Omorfi to create ~25M noun forms and testing on a 2000-form sample, the authors evaluate GPT-4-turbo, GPT-3.5-turbo, Llama2-70B, Poro-34B, and RNN baselines under 0/1/5/10-shot prompts. Results show GPT-4-turbo has limited systematic morpho-grammatical knowledge, while smaller LLMs perform poorly; in contrast, simple RNNs trained on substantial data outperform these LLMs on this task. The findings suggest that, despite advanced generation capabilities, LLMs rely on heuristics rather than fully human-like grammatical generalization for complex Finnish morphology, highlighting a gap between surface competence and explicit morphosyntactic understanding. The work underscores the value of targeted, explicit evaluation datasets and points toward integrating explicit grammar representations to improve morphological generalization in LLMs.

Abstract

Rule-based language processing systems have been overshadowed by neural systems in terms of utility, but it remains unclear whether neural NLP systems, in practice, learn the grammar rules that humans use. This work aims to shed light on the issue by evaluating state-of-the-art LLMs in a task of morphological analysis of complex Finnish noun forms. We generate the forms using an FST tool, and they are unlikely to have occurred in the training sets of the LLMs, therefore requiring morphological generalisation capacity. We find that GPT-4-turbo has some difficulties in the task while GPT-3.5-turbo struggles and smaller models Llama2-70B and Poro-34B fail nearly completely.

LLMs' morphological analyses of complex FST-generated Finnish words

TL;DR

Abstract

Paper Structure (11 sections, 13 figures, 5 tables)

This paper contains 11 sections, 13 figures, 5 tables.

Do neural networks learn grammar?
Data and methods
Results
Discussion
Reasons behind the errors
Interpretations and implications
Conclusion
Limitations
Acknowledgements
Details of the classification task
Detailed results

Figures (13)

Figure 1: Results in the morphological analysis task.
Figure 2: Case label confusions of GPT-4-turbo in the 0-shot and 10-shot setups. See Appendix \ref{['sec:detailed_results']} for all confusion matrices.
Figure 3: Possessive suffix label confusions of GPT-4-turbo in the 0-shot and 10-shot setups. See Appendix \ref{['sec:detailed_results']} for all confusion matrices.
Figure 4: Confusions in the GPT-4-turbo and GPT-3.5-turbo number classification task.
Figure 5: Confusions in the Llama2-70B and Poro-34B number classification task.
...and 8 more figures

LLMs' morphological analyses of complex FST-generated Finnish words

TL;DR

Abstract

LLMs' morphological analyses of complex FST-generated Finnish words

Authors

TL;DR

Abstract

Table of Contents

Figures (13)