Table of Contents
Fetching ...

Semantic Communities and Boundary-Spanning Lyrics in K-pop: A Graph-Based Unsupervised Analysis

Oktay Karakuş

TL;DR

This study tackles the challenge of uncovering latent semantic structure in large-scale, multilingual lyric corpora without supervision. It introduces a line-level, graph-based framework that embeds lyric lines with multilingual sentence representations, builds a sparse lyric similarity graph, and applies modularity-based community detection to reveal 18 stable semantic communities. Boundary-spanning songs are identified via betweenness centrality and neighbor-community diversity, and these bridges exhibit higher lexical diversity and lower repetition than core members; out-of-sample hits further validate the framework by locating contemporary songs at semantic interfaces. Across 7,983 K-pop songs, the approach demonstrates that semantics organize beyond artist labels and language, offering a robust, language-agnostic tool for analyzing cultural text and enabling cross-temporal and cross-artist comparisons. The methodology provides a scalable, interpretable means to study semantic hybridity and cross-theme accessibility in music lyrics, with potential applicability to other unlabeled cultural corpora.

Abstract

Large-scale lyric corpora present unique challenges for data-driven analysis, including the absence of reliable annotations, multilingual content, and high levels of stylistic repetition. Most existing approaches rely on supervised classification, genre labels, or coarse document-level representations, limiting their ability to uncover latent semantic structure. We present a graph-based framework for unsupervised discovery and evaluation of semantic communities in K-pop lyrics using line-level semantic representations. By constructing a similarity graph over lyric texts and applying community detection, we uncover stable micro-theme communities without genre, artist, or language supervision. We further identify boundary-spanning songs via graph-theoretic bridge metrics and analyse their structural properties. Across multiple robustness settings, boundary-spanning lyrics exhibit higher lexical diversity and lower repetition compared to core community members, challenging the assumption that hook intensity or repetition drives cross-theme connectivity. Our framework is language-agnostic and applicable to unlabeled cultural text corpora.

Semantic Communities and Boundary-Spanning Lyrics in K-pop: A Graph-Based Unsupervised Analysis

TL;DR

This study tackles the challenge of uncovering latent semantic structure in large-scale, multilingual lyric corpora without supervision. It introduces a line-level, graph-based framework that embeds lyric lines with multilingual sentence representations, builds a sparse lyric similarity graph, and applies modularity-based community detection to reveal 18 stable semantic communities. Boundary-spanning songs are identified via betweenness centrality and neighbor-community diversity, and these bridges exhibit higher lexical diversity and lower repetition than core members; out-of-sample hits further validate the framework by locating contemporary songs at semantic interfaces. Across 7,983 K-pop songs, the approach demonstrates that semantics organize beyond artist labels and language, offering a robust, language-agnostic tool for analyzing cultural text and enabling cross-temporal and cross-artist comparisons. The methodology provides a scalable, interpretable means to study semantic hybridity and cross-theme accessibility in music lyrics, with potential applicability to other unlabeled cultural corpora.

Abstract

Large-scale lyric corpora present unique challenges for data-driven analysis, including the absence of reliable annotations, multilingual content, and high levels of stylistic repetition. Most existing approaches rely on supervised classification, genre labels, or coarse document-level representations, limiting their ability to uncover latent semantic structure. We present a graph-based framework for unsupervised discovery and evaluation of semantic communities in K-pop lyrics using line-level semantic representations. By constructing a similarity graph over lyric texts and applying community detection, we uncover stable micro-theme communities without genre, artist, or language supervision. We further identify boundary-spanning songs via graph-theoretic bridge metrics and analyse their structural properties. Across multiple robustness settings, boundary-spanning lyrics exhibit higher lexical diversity and lower repetition compared to core community members, challenging the assumption that hook intensity or repetition drives cross-theme connectivity. Our framework is language-agnostic and applicable to unlabeled cultural text corpora.
Paper Structure (36 sections, 4 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 36 sections, 4 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Pipeline overview.
  • Figure 2: Distributions of lyrical and structural metrics across the K-pop lyric corpus. Shown are the distributions of (a) lexical entropy, (b) line repeat ratio, (c) chorus score, and (d) boundary score. The broad and skewed distributions indicate substantial heterogeneity in lyrical structure and semantic positioning across songs.
  • Figure 3: Comparison of lyrical properties between boundary-spanning and non-bridge songs. Boxplots show lexical entropy, line repeat ratio, and chorus score for the two groups. Boundary-spanning lyrics exhibit higher lexical entropy and modestly reduced repetition, indicating greater linguistic diversity.
  • Figure 4: UMAP projection of K-pop song lyrics coloured by Louvain semantic communities. Each point represents a song, and colours indicate unsupervised communities inferred from the lyrical similarity graph.
  • Figure 5: Artist-level distributions within the global semantic space. Songs by BTS and BLACKPINK are highlighted against the full lyric corpus, illustrating their dispersion across multiple semantic communities.
  • ...and 1 more figures