Dispersion Measures as Predictors of Lexical Decision Time, Word Familiarity, and Lexical Complexity

Adam Nohejl; Taro Watanabe

Dispersion Measures as Predictors of Lexical Decision Time, Word Familiarity, and Lexical Complexity

Adam Nohejl, Taro Watanabe

TL;DR

This study tackles the external validation of dispersion measures as predictors of lexical processing and word familiarity across five languages, using three tasks and three corpus granularities. It systematically evaluates a suite of dispersion metrics, applying log-transformations and smoothing, and compares them to log-frequency via adjusted $R^2$ in single- and two-predictor models on 11 datasets from the TUBELEX corpus. The results show that the logarithm of range, $\log R$, is the most robust dispersion predictor across tasks and languages, particularly at fine-grained granularities (videos and channels), and that incorporating $\log R$ with log-frequency yields reliable improvements for several measures. These findings provide practical guidelines for selecting dispersion features in lexical modeling and NLP applications, clarifying when and how dispersion adds predictive value beyond frequency.

Abstract

Various measures of dispersion have been proposed to paint a fuller picture of a word's distribution in a corpus, but only little has been done to validate them externally. We evaluate a wide range of dispersion measures as predictors of lexical decision time, word familiarity, and lexical complexity in five diverse languages. We find that the logarithm of range is not only a better predictor than log-frequency across all tasks and languages, but that it is also the most powerful additional variable to log-frequency, consistently outperforming the more complex dispersion measures. We discuss the effects of corpus part granularity and logarithmic transformation, shedding light on contradictory results of previous studies.

Dispersion Measures as Predictors of Lexical Decision Time, Word Familiarity, and Lexical Complexity

TL;DR

in single- and two-predictor models on 11 datasets from the TUBELEX corpus. The results show that the logarithm of range,

, is the most robust dispersion predictor across tasks and languages, particularly at fine-grained granularities (videos and channels), and that incorporating

with log-frequency yields reliable improvements for several measures. These findings provide practical guidelines for selecting dispersion features in lexical modeling and NLP applications, clarifying when and how dispersion adds predictive value beyond frequency.

Abstract

Paper Structure (7 sections, 3 equations, 1 figure, 1 table)

This paper contains 7 sections, 3 equations, 1 figure, 1 table.

Introduction
Related Research
Examined Measures
Evaluation
To Log or Not to Log
Results
Discussion

Figures (1)

Figure 1: Mean $R_\textrm{a}^2$ computed over 11 datasets for each dispersion measure, part granularity, and prediction with/without log-frequency as a second variable, where "(log)" indicates log-transformed measures. Stars indicate robust predictors, namely: [2]$\bigstar$ single predictors that were not significantly ($p < 0.001$) worse than log-frequency for any dataset, and [2]$\bigstar$ predictors that, when used with log-frequency, improved the prediction by $\Delta R_\textrm{a}^2 \geq 0.01$ for at least 8 of 11 datasets.

Dispersion Measures as Predictors of Lexical Decision Time, Word Familiarity, and Lexical Complexity

TL;DR

Abstract

Dispersion Measures as Predictors of Lexical Decision Time, Word Familiarity, and Lexical Complexity

Authors

TL;DR

Abstract

Table of Contents

Figures (1)