Table of Contents
Fetching ...

Unsupervised extraction of local and global keywords from a single text

Lida Aleksanyan, Armen E. Allahverdyan

TL;DR

This work introduces a corpus-independent, unsupervised framework for extracting keywords from a single text by analyzing the spatial distribution of words and its sensitivity to random permutations. By comparing the second moment of the word-gap distribution before and after permutation, the method identifies global keywords (spread across the text) and local keywords (clustered in specific parts), and it demonstrates topic extraction capabilities. Across several long literary works, the approach yields higher precision and recall than LUHN, YAKE, and related baselines, and it robustly transfers across English, Russian, and French, while also enabling a chapter-based alternative that relates keywords to discourse structure. The findings have implications for discourse analysis and topic extraction in literature and can be extended to n-grams and co-occurrence analyses, with future work aimed at improving models of random text and applying the method to shorter texts.

Abstract

We propose an unsupervised, corpus-independent method to extract keywords from a single text. It is based on the spatial distribution of words and the response of this distribution to a random permutation of words. As compared to existing methods (such as e.g. YAKE) our method has three advantages. First, it is significantly more effective at extracting keywords from long texts. Second, it allows inference of two types of keywords: local and global. Third, it uncovers basic themes in texts. Additionally, our method is language-independent and applies to short texts. The results are obtained via human annotators with previous knowledge of texts from our database of classical literary works (the agreement between annotators is from moderate to substantial). Our results are supported via human-independent arguments based on the average length of extracted content words and on the average number of nouns in extracted words. We discuss relations of keywords with higher-order textual features and reveal a connection between keywords and chapter divisions.

Unsupervised extraction of local and global keywords from a single text

TL;DR

This work introduces a corpus-independent, unsupervised framework for extracting keywords from a single text by analyzing the spatial distribution of words and its sensitivity to random permutations. By comparing the second moment of the word-gap distribution before and after permutation, the method identifies global keywords (spread across the text) and local keywords (clustered in specific parts), and it demonstrates topic extraction capabilities. Across several long literary works, the approach yields higher precision and recall than LUHN, YAKE, and related baselines, and it robustly transfers across English, Russian, and French, while also enabling a chapter-based alternative that relates keywords to discourse structure. The findings have implications for discourse analysis and topic extraction in literature and can be extended to n-grams and co-occurrence analyses, with future work aimed at improving models of random text and applying the method to shorter texts.

Abstract

We propose an unsupervised, corpus-independent method to extract keywords from a single text. It is based on the spatial distribution of words and the response of this distribution to a random permutation of words. As compared to existing methods (such as e.g. YAKE) our method has three advantages. First, it is significantly more effective at extracting keywords from long texts. Second, it allows inference of two types of keywords: local and global. Third, it uncovers basic themes in texts. Additionally, our method is language-independent and applies to short texts. The results are obtained via human annotators with previous knowledge of texts from our database of classical literary works (the agreement between annotators is from moderate to substantial). Our results are supported via human-independent arguments based on the average length of extracted content words and on the average number of nouns in extracted words. We discuss relations of keywords with higher-order textual features and reveal a connection between keywords and chapter divisions.
Paper Structure (24 sections, 20 equations, 2 figures, 6 tables)

This paper contains 24 sections, 20 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: For Anna Karenina by L. Tolstoy anna we show space frequency $\tau[w]=1/C_1[w]$ and $1/C_2[w]$versus word rank for all distinct words $w$ of the text; cf. Eqs. (\ref{['durnovo']}, \ref{['3']}). We also show two additional quantities: $1/C_2[w]=1/C_{2\,{\rm perm}}(w)$ after a random permutation of words in the text, and $f[w]/(1-f[w])$, where $f[w]$ is the frequency of $w$; see Eqs. (\ref{['dag']}, \ref{['ordinary']}). Ranking of distinct words is done via $f[w]$, i.e. the most frequent word got rank 1, etc. It is seen that $C_{2\,{\rm perm}}[w]<C_{2}[w]$ holds for frequent words. Both $C_{2\,{\rm perm}}[w]<C_{2}[w]$ and $C_{2\,{\rm perm}}[w]>C_{2}[w]$ hold for less frequent words. Not shown in the figure: a random permutation of the words in the text leaves $\tau[w]$ unaltered for frequent words, while $\tau[w]$ generically increases for less frequent words (clusterization); cf. Eq. (\ref{['dag']}).
  • Figure 2: For Animal Farm (AF) by G. Orwell we show the same quantities as for Anna Karenina (AK) in Fig. \ref{['annafig']} (also the same notations). AK is 11.6 times longer than AF; see Table \ref{['tab_gogo']}. Some differences between these texts are as follows. Inequality $C_{2\,{\rm perm}}(w)<C_2(w)$ holds for a lesser number of frequent words in AF compared with AK. Domain $C_{2\,{\rm perm}}(w)<C_2(w)$ and $C_{2\,{\rm perm}}(w)>C_2(w)$ are well-separated in AK, and not so well-separated in AF. For AF, relation (\ref{['dag']}) can be violated for some infrequent words.