Table of Contents
Fetching ...

Estimating Lexical Complexity from Document-Level Distributions

Sondre Wold, Petter Mæhlum, Oddbjørn Hove

TL;DR

The paper tackles the challenge of estimating lexical complexity for short health-assessment texts by proposing a two-step, annotation-free approach that leverages document-level distributions. It validates LIX as a discriminative document-level complexity measure across Norwegian corpora and derives lemma-level complexity by the median document-level score of the documents where a lemma occurs, with normalization to reflect exposure and frequency. The authors demonstrate the method’s generality by illustrating cross-language applicability with a Coleman-Liau-type English index, and they provide qualitative evidence that substitutions based on complexity scores can reduce cognitive load in domain-specific tools. They discuss limitations, such as handling multi-word expressions and the need for patient feedback, and outline future work toward broader language coverage and integration with language models to support practical health-care applications.

Abstract

Existing methods for complexity estimation are typically developed for entire documents. This limitation in scope makes them inapplicable for shorter pieces of text, such as health assessment tools. These typically consist of lists of independent sentences, all of which are too short for existing methods to apply. The choice of wording in these assessment tools is crucial, as both the cognitive capacity and the linguistic competency of the intended patient groups could vary substantially. As a first step towards creating better tools for supporting health practitioners, we develop a two-step approach for estimating lexical complexity that does not rely on any pre-annotated data. We implement our approach for the Norwegian language and verify its effectiveness using statistical testing and a qualitative evaluation of samples from real assessment tools. We also investigate the relationship between our complexity measure and certain features typically associated with complexity in the literature, such as word length, frequency, and the number of syllables.

Estimating Lexical Complexity from Document-Level Distributions

TL;DR

The paper tackles the challenge of estimating lexical complexity for short health-assessment texts by proposing a two-step, annotation-free approach that leverages document-level distributions. It validates LIX as a discriminative document-level complexity measure across Norwegian corpora and derives lemma-level complexity by the median document-level score of the documents where a lemma occurs, with normalization to reflect exposure and frequency. The authors demonstrate the method’s generality by illustrating cross-language applicability with a Coleman-Liau-type English index, and they provide qualitative evidence that substitutions based on complexity scores can reduce cognitive load in domain-specific tools. They discuss limitations, such as handling multi-word expressions and the need for patient feedback, and outline future work toward broader language coverage and integration with language models to support practical health-care applications.

Abstract

Existing methods for complexity estimation are typically developed for entire documents. This limitation in scope makes them inapplicable for shorter pieces of text, such as health assessment tools. These typically consist of lists of independent sentences, all of which are too short for existing methods to apply. The choice of wording in these assessment tools is crucial, as both the cognitive capacity and the linguistic competency of the intended patient groups could vary substantially. As a first step towards creating better tools for supporting health practitioners, we develop a two-step approach for estimating lexical complexity that does not rely on any pre-annotated data. We implement our approach for the Norwegian language and verify its effectiveness using statistical testing and a qualitative evaluation of samples from real assessment tools. We also investigate the relationship between our complexity measure and certain features typically associated with complexity in the literature, such as word length, frequency, and the number of syllables.
Paper Structure (30 sections, 3 equations, 4 figures, 6 tables)

This paper contains 30 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The distribution of LIX scores of texts from four different corpora. From left to right: children's books, news articles, encyclopedia entries, and legislative texts from the Norwegian parliament.
  • Figure 2: Normalised LIX scores for all content words in our corpus ($n=64\,071$)
  • Figure 3: The relationship between frequency and complexity score after normalization for a selection of lemmas.
  • Figure 4: Relationship between complexity and word-level features, from a sample of 10 000 lemmas.

Theorems & Definitions (4)

  • Example 4.1
  • Example 4.2
  • Example 4.3
  • Example 4.4