Table of Contents
Fetching ...

Human and Automatic Interpretation of Romanian Noun Compounds

Ioana Marinescu, Christiane Fellbaum

TL;DR

This work investigates how Romanian noun compounds are interpreted by humans and machines, addressing morphosyntactic differences from English and the limitations of existing semantic-relations inventories. It introduces a novel taxonomy with 16 labeled relations plus a 'none' category and trains a neural classifier on Romanian BERT embeddings to map compounds to these relations. Human judgments collected via Mechanical Turk reveal substantial disagreement, while the neural model achieves 68% alignment with human labels on a held-out set, though the prevalence of the 'none' category highlights gaps in the taxonomy. The findings suggest that a richer, language-aware set of semantic relations is needed to reliably interpret noun compounds across languages, with implications for translation and information access in less-studied languages. The study also provides a dataset and methodology to spur cross-linguistic semantic analysis of noun compounds.

Abstract

Determining the intended, context-dependent meanings of noun compounds like "shoe sale" and "fire sale" remains a challenge for NLP. Previous work has relied on inventories of semantic relations that capture the different meanings between compound members. Focusing on Romanian compounds, whose morphosyntax differs from that of their English counterparts, we propose a new set of relations and test it with human annotators and a neural net classifier. Results show an alignment of the network's predictions and human judgments, even where the human agreement rate is low. Agreement tracks with the frequency of the selected relations, regardless of structural differences. However, the most frequently selected relation was none of the sixteen labeled semantic relations, indicating the need for a better relation inventory.

Human and Automatic Interpretation of Romanian Noun Compounds

TL;DR

This work investigates how Romanian noun compounds are interpreted by humans and machines, addressing morphosyntactic differences from English and the limitations of existing semantic-relations inventories. It introduces a novel taxonomy with 16 labeled relations plus a 'none' category and trains a neural classifier on Romanian BERT embeddings to map compounds to these relations. Human judgments collected via Mechanical Turk reveal substantial disagreement, while the neural model achieves 68% alignment with human labels on a held-out set, though the prevalence of the 'none' category highlights gaps in the taxonomy. The findings suggest that a richer, language-aware set of semantic relations is needed to reliably interpret noun compounds across languages, with implications for translation and information access in less-studied languages. The study also provides a dataset and methodology to spur cross-linguistic semantic analysis of noun compounds.

Abstract

Determining the intended, context-dependent meanings of noun compounds like "shoe sale" and "fire sale" remains a challenge for NLP. Previous work has relied on inventories of semantic relations that capture the different meanings between compound members. Focusing on Romanian compounds, whose morphosyntax differs from that of their English counterparts, we propose a new set of relations and test it with human annotators and a neural net classifier. Results show an alignment of the network's predictions and human judgments, even where the human agreement rate is low. Agreement tracks with the frequency of the selected relations, regardless of structural differences. However, the most frequently selected relation was none of the sixteen labeled semantic relations, indicating the need for a better relation inventory.
Paper Structure (11 sections, 2 figures, 3 tables)

This paper contains 11 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Frequency of selection and frequency of agreement. The most frequently selected categories show the highest inter-annotator agreement.
  • Figure 2: Average frequency (in all noun compounds in Romanian UD treebank) of the heads and modifiers, respectively, of the compounds in each category as a function of how many times that category was selected by the annotators