Computational lexical analysis of Flamenco genres

Pablo Rosillo-Rodes; Maxi San Miguel; David Sanchez

Computational lexical analysis of Flamenco genres

Pablo Rosillo-Rodes, Maxi San Miguel, David Sanchez

TL;DR

This study addresses the lack of quantitative analysis of Flamenco lyrics by applying NLP and machine learning to a large lyric corpus and demonstrating that eight main palos can be classified using only lexical content. It uses TF-IDF features with a Multinomial Naive Bayes classifier to identify characteristic words and to quantify inter-palo distances via cosine similarity, further visualized through a minimum spanning tree to reveal lexical clusters. The results yield high palo-discrimination accuracy, uncover essential lexical fields, and produce a lexical-distance network that aligns with established historical kinships among palos. The work provides a quantitative framework for analyzing intangible cultural heritage lyrics, offering new insights into the origin and development of Flamenco styles and guiding future data collection and methodological refinements.

Abstract

Flamenco, recognized by UNESCO as part of the Intangible Cultural Heritage of Humanity, is a profound expression of cultural identity rooted in Andalusia, Spain. However, there is a lack of quantitative studies that help identify characteristic patterns in this long-lived music tradition. In this work, we present a computational analysis of Flamenco lyrics, employing natural language processing and machine learning to categorize over 2000 lyrics into their respective Flamenco genres, termed as $\textit{palos}$. Using a Multinomial Naive Bayes classifier, we find that lexical variation across styles enables to accurately identify distinct $\textit{palos}$. More importantly, from an automatic method of word usage, we obtain the semantic fields that characterize each style. Further, applying a metric that quantifies the inter-genre distance we perform a network analysis that sheds light on the relationship between Flamenco styles. Remarkably, our results suggest historical connections and $\textit{palo}$ evolutions. Overall, our work illuminates the intricate relationships and cultural significance embedded within Flamenco lyrics, complementing previous qualitative discussions with quantitative analyses and sparking new discussions on the origin and development of traditional music genres.

Computational lexical analysis of Flamenco genres

TL;DR

Abstract

. Using a Multinomial Naive Bayes classifier, we find that lexical variation across styles enables to accurately identify distinct

. More importantly, from an automatic method of word usage, we obtain the semantic fields that characterize each style. Further, applying a metric that quantifies the inter-genre distance we perform a network analysis that sheds light on the relationship between Flamenco styles. Remarkably, our results suggest historical connections and

evolutions. Overall, our work illuminates the intricate relationships and cultural significance embedded within Flamenco lyrics, complementing previous qualitative discussions with quantitative analyses and sparking new discussions on the origin and development of traditional music genres.

Paper Structure (30 sections, 5 equations, 29 figures, 3 tables)

This paper contains 30 sections, 5 equations, 29 figures, 3 tables.

Introduction
Results
Dataset
Vocabulary size distribution.
Standardized type-to-token ratio.
Palo classification
Characteristic lexicon extraction
Relationship between palos
Discussion
Conclusions
Corpus characterization methods
Empirical laws
Text preprocessing
Additional methods
Multinomial Naive Bayes
...and 15 more sections

Figures (29)

Figure 1: Distribution of the number of lyrics for the 20 most represented genres or palos in the corpus. With a red rectangle, we specify the 8 palos with the highest representation. There are 58 additional palos which count on less than 30 lyrics, which are not shown due to their high under-representation.
Figure 2: Amount of tokens $L$ (bars on the right) and types $|V|$ (bars on the left) for each style.
Figure 3: Distribution of the vocabulary size, $|V|$ (number of types) for each style. The continuous distribution is derived from a discrete histogram using the kernel density estimation method seaborn_kdeplot.
Figure 4: Standardized type-to-token ratio $(sTTR)$ for each genre. With a horizontal continuous line, we show the mean $sTTR$ for the entire corpus (our null model), and with two discontinuous lines, we show the error of the mean.
Figure 5: Confusion matrix of the MNB for an arbitrary training, showing the percentage of each palo or genre correctly predicted or confused.
...and 24 more figures

Computational lexical analysis of Flamenco genres

TL;DR

Abstract

Computational lexical analysis of Flamenco genres

Authors

TL;DR

Abstract

Table of Contents

Figures (29)