Annotating Compositionality Scores for Irish Noun Compounds is Hard Work
Abigail Walsh, Teresa Clifford, Emma Daly, Jane Dunne, Brian Davis, Gearóid Ó Cleircín
TL;DR
Irish noun compounds pose significant NLP challenges due to variable compositionality. The authors construct a corpus of Irish NCs from diverse domains annotated with compositionality, domain specificity, annotator familiarity, and confidence, using a two-word NCC constraint and PARSEME-inspired guidelines. They operationalize data from the Dúchas folklore collection and the UD-IDT, publish annotation guidelines, and provide pilot annotations along with preliminary analyses that reveal differences between data sources and the impact of domain knowledge and annotator familiarity. This work establishes Irish-specific NC resources and methodologies to enable robust NLP processing and evaluation of multilingual models, including LLMs, on Irish NC interpretation.
Abstract
Noun compounds constitute a challenging construction for NLP applications, given their variability in idiomaticity and interpretation. In this paper, we present an analysis of compound nouns identified in Irish text of varied domains by expert annotators, focusing on compositionality as a key feature, but also domain specificity, as well as familiarity and confidence of the annotator giving the ratings. Our findings and the discussion that ensued contributes towards a greater understanding of how these constructions appear in Irish language, and how they might be treated separately from English noun compounds.
