Table of Contents
Fetching ...

The aftermath of compounds: Investigating Compounds and their Semantic Representations

Swarang Joshi

TL;DR

The paper investigates how well static (GloVe) and contextual (BERT) embeddings reflect human compound semantics, focusing on lexeme meaning dominance ($LMD$) and semantic transparency ($ST$). It uses a 628-item dataset annotated for these measures and augments it with association strength from the Edinburgh Associative Thesaurus (EAT), predictability from LaDEC, and frequency from the British National Corpus (BNC). Embeddings are used to compute $LMD$ and $ST$ via cosine-based formulas $LMD = |cos(v_c, v_l) - cos(v_c, v_r)| \times 4 + 5$ and $ST = (cos(v_c, v_l) + cos(v_c, v_r))/2 \times 3.5$, with evaluation via Spearman correlations and MAE, and regressors predicting $LMD$/$ST$ from association, frequency, and predictability. The results show that BERT aligns more closely with human judgments than GloVe, with predictability emerging as the strongest predictor of $ST$ across humans and models, informing embedding selection and feature integration for computational psycholinguistics and supporting dual-route theories of compound processing.

Abstract

This study investigates how well computational embeddings align with human semantic judgments in the processing of English compound words. We compare static word vectors (GloVe) and contextualized embeddings (BERT) against human ratings of lexeme meaning dominance (LMD) and semantic transparency (ST) drawn from a psycholinguistic dataset. Using measures of association strength (Edinburgh Associative Thesaurus), frequency (BNC), and predictability (LaDEC), we compute embedding-derived LMD and ST metrics and assess their relationships with human judgments via Spearmans correlation and regression analyses. Our results show that BERT embeddings better capture compositional semantics than GloVe, and that predictability ratings are strong predictors of semantic transparency in both human and model data. These findings advance computational psycholinguistics by clarifying the factors that drive compound word processing and offering insights into embedding-based semantic modeling.

The aftermath of compounds: Investigating Compounds and their Semantic Representations

TL;DR

The paper investigates how well static (GloVe) and contextual (BERT) embeddings reflect human compound semantics, focusing on lexeme meaning dominance () and semantic transparency (). It uses a 628-item dataset annotated for these measures and augments it with association strength from the Edinburgh Associative Thesaurus (EAT), predictability from LaDEC, and frequency from the British National Corpus (BNC). Embeddings are used to compute and via cosine-based formulas and , with evaluation via Spearman correlations and MAE, and regressors predicting / from association, frequency, and predictability. The results show that BERT aligns more closely with human judgments than GloVe, with predictability emerging as the strongest predictor of across humans and models, informing embedding selection and feature integration for computational psycholinguistics and supporting dual-route theories of compound processing.

Abstract

This study investigates how well computational embeddings align with human semantic judgments in the processing of English compound words. We compare static word vectors (GloVe) and contextualized embeddings (BERT) against human ratings of lexeme meaning dominance (LMD) and semantic transparency (ST) drawn from a psycholinguistic dataset. Using measures of association strength (Edinburgh Associative Thesaurus), frequency (BNC), and predictability (LaDEC), we compute embedding-derived LMD and ST metrics and assess their relationships with human judgments via Spearmans correlation and regression analyses. Our results show that BERT embeddings better capture compositional semantics than GloVe, and that predictability ratings are strong predictors of semantic transparency in both human and model data. These findings advance computational psycholinguistics by clarifying the factors that drive compound word processing and offering insights into embedding-based semantic modeling.

Paper Structure

This paper contains 18 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Compound type distribution in dataset (n=628): 68% endocentric, 31% exocentric, <1% copulative.
  • Figure 2: Compound Metrics Heatmap. TRAN refers to ST
  • Figure 3: Performance of Regressors
  • Figure 4: Bert vs GloVe LMD distribution
  • Figure 5: Bert vs GloVe LMD distribution