Table of Contents
Fetching ...

Automated Construction of Theme-specific Knowledge Graphs

Linyi Ding, Sizhe Zhou, Jinfeng Xiao, Jiawei Han

TL;DR

This work addresses two core KG challenges—fine-grained information and up-to-date coverage—by introducing ThemeKG, a theme-specific knowledge graph built from a theme corpus through an unsupervised framework (TKGCon). The approach constructs a theme ontology from Wikipedia and an MLLM-generated relation ontology, then performs entity recognition/typing and context-aware relation extraction to yield coherent, theme-aligned triples. Empirical results on EV battery and Hamas-attack-on-Israel demonstrate superior entity, triple, and theme-coherence metrics over strong baselines, with ablations confirming the value of ontology guidance. The resulting ThemeKGs enable timely, fine-grained knowledge augmentation for LLMs and downstream tasks such as QA and RAG, facilitating theme-specific reasoning and information retrieval.

Abstract

Despite widespread applications of knowledge graphs (KGs) in various tasks such as question answering and intelligent conversational systems, existing KGs face two major challenges: information granularity and deficiency in timeliness. These hinder considerably the retrieval and analysis of in-context, fine-grained, and up-to-date knowledge from KGs, particularly in highly specialized themes (e.g., specialized scientific research) and rapidly evolving contexts (e.g., breaking news or disaster tracking). To tackle such challenges, we propose a theme-specific knowledge graph (i.e., ThemeKG), a KG constructed from a theme-specific corpus, and design an unsupervised framework for ThemeKG construction (named TKGCon). The framework takes raw theme-specific corpus and generates a high-quality KG that includes salient entities and relations under the theme. Specifically, we start with an entity ontology of the theme from Wikipedia, based on which we then generate candidate relations by Large Language Models (LLMs) to construct a relation ontology. To parse the documents from the theme corpus, we first map the extracted entity pairs to the ontology and retrieve the candidate relations. Finally, we incorporate the context and ontology to consolidate the relations for entity pairs. We observe that directly prompting GPT-4 for theme-specific KG leads to inaccurate entities (such as "two main types" as one entity in the query result) and unclear (such as "is", "has") or wrong relations (such as "have due to", "to start"). In contrast, by constructing the theme-specific KG step by step, our model outperforms GPT-4 and could consistently identify accurate entities and relations. Experimental results also show that our framework excels in evaluations compared with various KG construction baselines.

Automated Construction of Theme-specific Knowledge Graphs

TL;DR

This work addresses two core KG challenges—fine-grained information and up-to-date coverage—by introducing ThemeKG, a theme-specific knowledge graph built from a theme corpus through an unsupervised framework (TKGCon). The approach constructs a theme ontology from Wikipedia and an MLLM-generated relation ontology, then performs entity recognition/typing and context-aware relation extraction to yield coherent, theme-aligned triples. Empirical results on EV battery and Hamas-attack-on-Israel demonstrate superior entity, triple, and theme-coherence metrics over strong baselines, with ablations confirming the value of ontology guidance. The resulting ThemeKGs enable timely, fine-grained knowledge augmentation for LLMs and downstream tasks such as QA and RAG, facilitating theme-specific reasoning and information retrieval.

Abstract

Despite widespread applications of knowledge graphs (KGs) in various tasks such as question answering and intelligent conversational systems, existing KGs face two major challenges: information granularity and deficiency in timeliness. These hinder considerably the retrieval and analysis of in-context, fine-grained, and up-to-date knowledge from KGs, particularly in highly specialized themes (e.g., specialized scientific research) and rapidly evolving contexts (e.g., breaking news or disaster tracking). To tackle such challenges, we propose a theme-specific knowledge graph (i.e., ThemeKG), a KG constructed from a theme-specific corpus, and design an unsupervised framework for ThemeKG construction (named TKGCon). The framework takes raw theme-specific corpus and generates a high-quality KG that includes salient entities and relations under the theme. Specifically, we start with an entity ontology of the theme from Wikipedia, based on which we then generate candidate relations by Large Language Models (LLMs) to construct a relation ontology. To parse the documents from the theme corpus, we first map the extracted entity pairs to the ontology and retrieve the candidate relations. Finally, we incorporate the context and ontology to consolidate the relations for entity pairs. We observe that directly prompting GPT-4 for theme-specific KG leads to inaccurate entities (such as "two main types" as one entity in the query result) and unclear (such as "is", "has") or wrong relations (such as "have due to", "to start"). In contrast, by constructing the theme-specific KG step by step, our model outperforms GPT-4 and could consistently identify accurate entities and relations. Experimental results also show that our framework excels in evaluations compared with various KG construction baselines.
Paper Structure (30 sections, 3 figures, 4 tables)

This paper contains 30 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: : Given a set of theme-specific documents, automatic construction of a theme-specific knowledge graph.
  • Figure 2: The overall framework of consists of (i) ontology construction and (ii) construction. For (i), it leverages the large general Wikipedia and GPT-4's reasoning ability to obtain high-quality entity ontology and relation ontology for the given theme. For (ii), we first process the theme documents, with phrase mining by SpaCy, entity typing by ZOE to retrieve candidate relations. Finally, the candidate relations generated by LLMs are further filtered with contextual information to consolidate final relations.
  • Figure 3: Comparison of and WikiData on EV battery. For the theme EV battery, the left side is part of our extracted extracted from theme-specific document corpus. The triples on the right side are retrieved from WikiData of the same topics. Our contains more specific entities and relations of the theme compared to the WikiData KG.