Table of Contents
Fetching ...

D4R -- Exploring and Querying Relational Graphs Using Natural Language and Large Language Models -- the Case of Historical Documents

Michel Boeglin, David Kahn, Josiane Mothe, Diego Ortiz, David Panzoli

TL;DR

D4R addresses the challenge of enabling non-technical historians to interrogate large textual corpora via graph representations and natural-language interfaces. It combines graph-based data modeling in Neo4j with LLM-driven translation of natural language into Cypher and an intuitive visualization interface. The paper demonstrates cross-domain adaptability (e.g., CoNLL04) and provides an expert mode for direct Cypher editing, confirming practical utility beyond historical documents. The approach supports incremental knowledge discovery while anchored provenance to source texts, highlighting AI as an augmentation rather than replacement for scholarly workflow.

Abstract

D4R is a digital platform designed to assist non-technical users, particularly historians, in exploring textual documents through advanced graphical tools for text analysis and knowledge extraction. By leveraging a large language model, D4R translates natural language questions into Cypher queries, enabling the retrieval of data from a Neo4J database. A user-friendly graphical interface allows for intuitive interaction, enabling users to navigate and analyse complex relational data extracted from unstructured textual documents. Originally designed to bridge the gap between AI technologies and historical research, D4R's capabilities extend to various other domains. A demonstration video and a live software demo are available.

D4R -- Exploring and Querying Relational Graphs Using Natural Language and Large Language Models -- the Case of Historical Documents

TL;DR

D4R addresses the challenge of enabling non-technical historians to interrogate large textual corpora via graph representations and natural-language interfaces. It combines graph-based data modeling in Neo4j with LLM-driven translation of natural language into Cypher and an intuitive visualization interface. The paper demonstrates cross-domain adaptability (e.g., CoNLL04) and provides an expert mode for direct Cypher editing, confirming practical utility beyond historical documents. The approach supports incremental knowledge discovery while anchored provenance to source texts, highlighting AI as an augmentation rather than replacement for scholarly workflow.

Abstract

D4R is a digital platform designed to assist non-technical users, particularly historians, in exploring textual documents through advanced graphical tools for text analysis and knowledge extraction. By leveraging a large language model, D4R translates natural language questions into Cypher queries, enabling the retrieval of data from a Neo4J database. A user-friendly graphical interface allows for intuitive interaction, enabling users to navigate and analyse complex relational data extracted from unstructured textual documents. Originally designed to bridge the gap between AI technologies and historical research, D4R's capabilities extend to various other domains. A demonstration video and a live software demo are available.

Paper Structure

This paper contains 8 sections, 7 figures.

Figures (7)

  • Figure 1: Left: Distribution of node types in the historical dataset by category. Right: Distribution of relationship types in the historical dataset.
  • Figure 2: D4R presents any text corpus as a graph, where entities and relationships were automatically extracted. Users can query this graph by formulating natural language questions, which are then automatically translated into Cypher queries to retrieve information from the Neo4J database.
  • Figure 3: (a) A user query, expressed in natural language, is translated into Cypher syntax for execution in the Neo4j database. The result is a subgraph, displayed in the main window, along with a natural language response. (b) D4R's expert mode enables users to edit Cypher queries directly, provided they have the necessary expertise.
  • Figure 4: The first query aims to identify entities in the corpus that have both the “person” and “religious” attributes.
  • Figure 5: The graph generated by the second query identifies individuals who interacted with Fray Bartolomé de Miranda and quantifies their level of communication. Note: the distinction of data labels in this illustration is not essential for understanding.
  • ...and 2 more figures