Table of Contents
Fetching ...

Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization

Ryan C. Barron, Maksim E. Eren, Olga M. Serafimova, Cynthia Matuszek, Boian S. Alexandrov

TL;DR

This paper tackles the challenge of retrieving and reasoning over complex legal texts by introducing a domain-specific retrieval-augmented generation framework that combines vector stores, a Neo4j knowledge graph, and hierarchical non-negative matrix factorization (HNMFk). The system ingests public legal corpora (via Justia) and builds a multi-level topic structure that feeds a KG, enabling semantically grounded and explainable QA when paired with LLMs. It provides a NM-focused dataset and evaluates retrieval and QA performance using metrics like ROUGE-L, NLI, SummaC, and FactCC, supplemented by human judgments and four concrete case studies. The approach advances computational law by fusing semantic embeddings, latent topic discovery, and graph-based reasoning to improve accuracy, traceability, and scalability in legal information retrieval and reasoning, while outlining future work on broader corpus coverage and deeper precedent analysis.

Abstract

Agentic Generative AI, powered by Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG), Knowledge Graphs (KGs), and Vector Stores (VSs), represents a transformative technology applicable to specialized domains such as legal systems, research, recommender systems, cybersecurity, and global security, including proliferation research. This technology excels at inferring relationships within vast unstructured or semi-structured datasets. The legal domain here comprises complex data characterized by extensive, interrelated, and semi-structured knowledge systems with complex relations. It comprises constitutions, statutes, regulations, and case law. Extracting insights and navigating the intricate networks of legal documents and their relations is crucial for effective legal research. Here, we introduce a generative AI system that integrates RAG, VS, and KG, constructed via Non-Negative Matrix Factorization (NMF), to enhance legal information retrieval and AI reasoning and minimize hallucinations. In the legal system, these technologies empower AI agents to identify and analyze complex connections among cases, statutes, and legal precedents, uncovering hidden relationships and predicting legal trends-challenging tasks that are essential for ensuring justice and improving operational efficiency. Our system employs web scraping techniques to systematically collect legal texts, such as statutes, constitutional provisions, and case law, from publicly accessible platforms like Justia. It bridges the gap between traditional keyword-based searches and contextual understanding by leveraging advanced semantic representations, hierarchical relationships, and latent topic discovery. This framework supports legal document clustering, summarization, and cross-referencing, for scalable, interpretable, and accurate retrieval for semi-structured data while advancing computational law and AI.

Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization

TL;DR

This paper tackles the challenge of retrieving and reasoning over complex legal texts by introducing a domain-specific retrieval-augmented generation framework that combines vector stores, a Neo4j knowledge graph, and hierarchical non-negative matrix factorization (HNMFk). The system ingests public legal corpora (via Justia) and builds a multi-level topic structure that feeds a KG, enabling semantically grounded and explainable QA when paired with LLMs. It provides a NM-focused dataset and evaluates retrieval and QA performance using metrics like ROUGE-L, NLI, SummaC, and FactCC, supplemented by human judgments and four concrete case studies. The approach advances computational law by fusing semantic embeddings, latent topic discovery, and graph-based reasoning to improve accuracy, traceability, and scalability in legal information retrieval and reasoning, while outlining future work on broader corpus coverage and deeper precedent analysis.

Abstract

Agentic Generative AI, powered by Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG), Knowledge Graphs (KGs), and Vector Stores (VSs), represents a transformative technology applicable to specialized domains such as legal systems, research, recommender systems, cybersecurity, and global security, including proliferation research. This technology excels at inferring relationships within vast unstructured or semi-structured datasets. The legal domain here comprises complex data characterized by extensive, interrelated, and semi-structured knowledge systems with complex relations. It comprises constitutions, statutes, regulations, and case law. Extracting insights and navigating the intricate networks of legal documents and their relations is crucial for effective legal research. Here, we introduce a generative AI system that integrates RAG, VS, and KG, constructed via Non-Negative Matrix Factorization (NMF), to enhance legal information retrieval and AI reasoning and minimize hallucinations. In the legal system, these technologies empower AI agents to identify and analyze complex connections among cases, statutes, and legal precedents, uncovering hidden relationships and predicting legal trends-challenging tasks that are essential for ensuring justice and improving operational efficiency. Our system employs web scraping techniques to systematically collect legal texts, such as statutes, constitutional provisions, and case law, from publicly accessible platforms like Justia. It bridges the gap between traditional keyword-based searches and contextual understanding by leveraging advanced semantic representations, hierarchical relationships, and latent topic discovery. This framework supports legal document clustering, summarization, and cross-referencing, for scalable, interpretable, and accurate retrieval for semi-structured data while advancing computational law and AI.

Paper Structure

This paper contains 26 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: System Overview
  • Figure 2: Knowledge Graph Schema, with the primary identifier in bold and attributes in brackets.
  • Figure 3: New Mexico Supreme/Appeals case counts per year.
  • Figure 4: Legal Documents from New Mexico hierarchically decomposed. The Constitution only had enough documents to decompose the first depths, whereas the other three sources continued to the terminal depth of 2 (a hyper-parameter of decomposition). Each H-cluster has a natural language label, where depth-0 from each can be seen in Tables \ref{['tab:Constitution']}, \ref{['tab:statutes']}, \ref{['tab:supreme']}, and \ref{['tab:appeals']}.
  • Figure 5: Examination of 'Estoppel' relating to being a keyword in topics, vs bag of word vocabulary.
  • ...and 3 more figures