Table of Contents
Fetching ...

BanglaAutoKG: Automatic Bangla Knowledge Graph Construction with Semantic Neural Graph Filtering

Azmine Toushik Wasi, Taki Hasan Rafi, Raima Islam, Dong-Kyu Chae

TL;DR

BanglaAutoKG addresses the underrepresentation of Bengali in knowledge graphs by introducing an automated framework that constructs semantically enriched Bengali KGs from any Bangla text. It integrates multilingual LLMs for entity and relation extraction, a translation dictionary for cross-language mapping, and BERT-based node features, followed by a two-stage denoising and semantic-filtering pipeline built on graph neural networks and attention-based convolutions. The approach yields a base KG that is progressively refined into a final, semantically coherent graph, demonstrated through case studies on poems and Wikipedia articles and supported by ablation analyses. This work enables scalable, language-agnostic KG generation for Bangla, with broad implications for information retrieval, fact-checking, and knowledge discovery in Bengali-language contexts.

Abstract

Knowledge Graphs (KGs) have proven essential in information processing and reasoning applications because they link related entities and give context-rich information, supporting efficient information retrieval and knowledge discovery; presenting information flow in a very effective manner. Despite being widely used globally, Bangla is relatively underrepresented in KGs due to a lack of comprehensive datasets, encoders, NER (named entity recognition) models, POS (part-of-speech) taggers, and lemmatizers, hindering efficient information processing and reasoning applications in the language. Addressing the KG scarcity in Bengali, we propose BanglaAutoKG, a pioneering framework that is able to automatically construct Bengali KGs from any Bangla text. We utilize multilingual LLMs to understand various languages and correlate entities and relations universally. By employing a translation dictionary to identify English equivalents and extracting word features from pre-trained BERT models, we construct the foundational KG. To reduce noise and align word embeddings with our goal, we employ graph-based polynomial filters. Lastly, we implement a GNN-based semantic filter, which elevates contextual understanding and trims unnecessary edges, culminating in the formation of the definitive KG. Empirical findings and case studies demonstrate the universal effectiveness of our model, capable of autonomously constructing semantically enriched KGs from any text.

BanglaAutoKG: Automatic Bangla Knowledge Graph Construction with Semantic Neural Graph Filtering

TL;DR

BanglaAutoKG addresses the underrepresentation of Bengali in knowledge graphs by introducing an automated framework that constructs semantically enriched Bengali KGs from any Bangla text. It integrates multilingual LLMs for entity and relation extraction, a translation dictionary for cross-language mapping, and BERT-based node features, followed by a two-stage denoising and semantic-filtering pipeline built on graph neural networks and attention-based convolutions. The approach yields a base KG that is progressively refined into a final, semantically coherent graph, demonstrated through case studies on poems and Wikipedia articles and supported by ablation analyses. This work enables scalable, language-agnostic KG generation for Bangla, with broad implications for information retrieval, fact-checking, and knowledge discovery in Bengali-language contexts.

Abstract

Knowledge Graphs (KGs) have proven essential in information processing and reasoning applications because they link related entities and give context-rich information, supporting efficient information retrieval and knowledge discovery; presenting information flow in a very effective manner. Despite being widely used globally, Bangla is relatively underrepresented in KGs due to a lack of comprehensive datasets, encoders, NER (named entity recognition) models, POS (part-of-speech) taggers, and lemmatizers, hindering efficient information processing and reasoning applications in the language. Addressing the KG scarcity in Bengali, we propose BanglaAutoKG, a pioneering framework that is able to automatically construct Bengali KGs from any Bangla text. We utilize multilingual LLMs to understand various languages and correlate entities and relations universally. By employing a translation dictionary to identify English equivalents and extracting word features from pre-trained BERT models, we construct the foundational KG. To reduce noise and align word embeddings with our goal, we employ graph-based polynomial filters. Lastly, we implement a GNN-based semantic filter, which elevates contextual understanding and trims unnecessary edges, culminating in the formation of the definitive KG. Empirical findings and case studies demonstrate the universal effectiveness of our model, capable of autonomously constructing semantically enriched KGs from any text.
Paper Structure (11 sections, 5 equations, 3 figures, 1 table)

This paper contains 11 sections, 5 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: The overall framework of our BanglaAutoKG. It involves passing text data through a multilingual LLMs to obtain entities and entity types, which are used to build a base KG with dictionary-based BERT embeddings. This graph is then semantically filtered using local neighborhood and topological relations to extract important nodes and edges, resulting in the final KG.
  • Figure 2: Case Study: KG of a Poem.
  • Figure 3: Case Study: KG of a Wikipedia section.