Table of Contents
Fetching ...

Mapping Hymns and Organizing Concepts in the Rigveda: Quantitatively Connecting the Vedic Suktas

Venkatesh Bollineni, Igor Crk, Eren Gultepe

TL;DR

The paper tackles the challenge of organizing the Rigveda's vast, archaic text by building a network of suktas using three embedding strategies and a six-step NLP pipeline. A novel mean-LSA embedding is introduced, combining word-level LSA vectors into per-suktа representations, and the networks' significance is tested against a null permutation distribution. Results show mean-LSA produces a statistically significant topic structure (modularity around $Q=0.944$, $p<0.01$) that aligns with all seven traditional suktā groupings, while SBERT and Doc2Vec fail to achieve significance. The study provides a data-driven framework for navigating the Rigveda and highlights the importance of statistical validation for topic networks in ancient texts, with implications for future Sanskrit NLP and sacred-text analytics.

Abstract

Accessing and gaining insight into the Rigveda poses a non-trivial challenge due to its extremely ancient Sanskrit language, poetic structure, and large volume of text. By using NLP techniques, this study identified topics and semantic connections of hymns within the Rigveda that were corroborated by seven well-known groupings of hymns. The 1,028 suktas (hymns) from the modern English translation of the Rigveda by Jamison and Brereton were preprocessed and sukta-level embeddings were obtained using, i) a novel adaptation of LSA, presented herein, ii) SBERT, and iii) Doc2Vec embeddings. Following an UMAP dimension reduction of the vectors, the network of suktas was formed using k-nearest neighbours. Then, community detection of topics in the sukta networks was performed with the Louvain, Leiden, and label propagation methods, whose statistical significance of the formed topics were determined using an appropriate null distribution. Only the novel adaptation of LSA using the Leiden method, had detected sukta topic networks that were significant (z = 2.726, p < .01) with a modularity score of 0.944. Of the seven famous sukta groupings analyzed (e.g., creation, funeral, water, etc.) the LSA derived network was successful in all seven cases, while Doc2Vec was not significant and failed to detect the relevant suktas. SBERT detected four of the famous suktas as separate groups, but mistakenly combined three of them into a single mixed group. Also, the SBERT network was not statistically significant.

Mapping Hymns and Organizing Concepts in the Rigveda: Quantitatively Connecting the Vedic Suktas

TL;DR

The paper tackles the challenge of organizing the Rigveda's vast, archaic text by building a network of suktas using three embedding strategies and a six-step NLP pipeline. A novel mean-LSA embedding is introduced, combining word-level LSA vectors into per-suktа representations, and the networks' significance is tested against a null permutation distribution. Results show mean-LSA produces a statistically significant topic structure (modularity around , ) that aligns with all seven traditional suktā groupings, while SBERT and Doc2Vec fail to achieve significance. The study provides a data-driven framework for navigating the Rigveda and highlights the importance of statistical validation for topic networks in ancient texts, with implications for future Sanskrit NLP and sacred-text analytics.

Abstract

Accessing and gaining insight into the Rigveda poses a non-trivial challenge due to its extremely ancient Sanskrit language, poetic structure, and large volume of text. By using NLP techniques, this study identified topics and semantic connections of hymns within the Rigveda that were corroborated by seven well-known groupings of hymns. The 1,028 suktas (hymns) from the modern English translation of the Rigveda by Jamison and Brereton were preprocessed and sukta-level embeddings were obtained using, i) a novel adaptation of LSA, presented herein, ii) SBERT, and iii) Doc2Vec embeddings. Following an UMAP dimension reduction of the vectors, the network of suktas was formed using k-nearest neighbours. Then, community detection of topics in the sukta networks was performed with the Louvain, Leiden, and label propagation methods, whose statistical significance of the formed topics were determined using an appropriate null distribution. Only the novel adaptation of LSA using the Leiden method, had detected sukta topic networks that were significant (z = 2.726, p < .01) with a modularity score of 0.944. Of the seven famous sukta groupings analyzed (e.g., creation, funeral, water, etc.) the LSA derived network was successful in all seven cases, while Doc2Vec was not significant and failed to detect the relevant suktas. SBERT detected four of the famous suktas as separate groups, but mistakenly combined three of them into a single mixed group. Also, the SBERT network was not statistically significant.

Paper Structure

This paper contains 18 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Processing pipeline for obtaining the network of suktas and topics using the three types of embedding techniques (mean-LSA, SBERT, Doc2Vec). Steps (1) and (2) created the embeddings to form the sukta networks. In steps (3) and (4), using the 4-nearest neighbours of each sukta, the network of topics were detected using community detection methods. Finally, in steps (5) and (6), the statistical significance of the detected network structures were determined and the grouped suktas were analyzed.
  • Figure 2: UMAP visualization of the Rigveda sukta network derived from mean-LSA embeddings. Top: The full network representation, shows 43 unique clusters with a modularity of 0.944 that has statistically significance structure (z = 2.726, p < .01). Bottom: The highlighted clusters represent a subset of seven famous sukta topics - Creation, Marut, Water, Surya, Brihaspati, Heaven & Earth, and Funeral. The mean-LSA embedding network was successful in identifying clusters that contained the semantically related suktas in all seven cases.
  • Figure 3: UMAP visualization of the Rigveda sukta network derived from SBERT embeddings. Top: The full network representation, shows 47 distinct clusters with a modularity of 0.950. Although SBERT's modularity is slightly higher than mean-LSA's modularity (0.944), it failed the significance test (z = -0.876, p = .810). Bottom: SBERT failed to separate three different topics of suktas (Creation, Funeral, Heaven & Earth suktas) and clustered them into a single cluster (Mixed).
  • Figure 4: UMAP visualization of the Rigveda sukta network derived from Doc2Vec embeddings. Top: The full network depicts 55 individual clusters with modularity of 0.952, which is the highest among the three sukta embeddings methods. Despite having higher modularity, it was unsuccessful in passing the statistical significance test (z = -0.126, p = .550). Bottom: For three out of the seven famous cases, Doc2Vec failed to group the semantically related suktas into relevant clusters and for the four remaining cases (Marut, Surya, Brihaspati, Funeral) the suktas were irregularly distributed.
  • Figure 5: Comparison of the Creation sukta clusters for the mean-LSA and SBERT sukta embeddings. Top: The network of famous Creation suktas using mean-LSA has gathered all the well-known nine suktas (relevant suktas) into a single cluster with 22 other non-famous suktas. Bottom: SBERT has categorized eight of the nine popular creation suktas together. However, this cluster also contains suktas from other two topics (Funeral and Heaven & Earth), indicating that SBERT failed to distinguish suktas belonging to other topics.
  • ...and 2 more figures