Table of Contents
Fetching ...

Combining Knowledge Graphs and NLP to Analyze Instant Messaging Data in Criminal Investigations

Riccardo Pozzi, Valentina Barbera, Renzo Alva Principe, Davide Giardini, Riccardo Rubini, Matteo Palmonari

TL;DR

The paper tackles the challenge of analyzing rich, heterogeneous instant messaging data in criminal investigations by coupling knowledge graphs with NLP to semantically enrich chat data and support investigators. The architecture combines a Neo4j-based knowledge graph, Whisper-driven audio transcriptions, and the NEEL NLP pipeline for robust entity extraction and linking, complemented by semantic search interfaces (Neo4j UI and DAVE). Real-world evaluation across two investigations demonstrates improved search and exploration capabilities, with investigators providing positive feedback while highlighting limitations and the need for domain-specific fine-tuning and expanded multimedia support. The work emphasizes human-in-the-loop verification, traceability of evidence, and secure, compliant data handling, outlining concrete future directions such as image/video enrichment and conversational search interfaces to further empower prosecutors and investigators.

Abstract

Criminal investigations often involve the analysis of messages exchanged through instant messaging apps such as WhatsApp, which can be an extremely effort-consuming task. Our approach integrates knowledge graphs and NLP models to support this analysis by semantically enriching data collected from suspects' mobile phones, and help prosecutors and investigators search into the data and get valuable insights. Our semantic enrichment process involves extracting message data and modeling it using a knowledge graph, generating transcriptions of voice messages, and annotating the data using an end-to-end entity extraction approach. We adopt two different solutions to help users get insights into the data, one based on querying and visualizing the graph, and one based on semantic search. The proposed approach ensures that users can verify the information by accessing the original data. While we report about early results and prototypes developed in the context of an ongoing project, our proposal has undergone practical applications with real investigation data. As a consequence, we had the chance to interact closely with prosecutors, collecting positive feedback but also identifying interesting opportunities as well as promising research directions to share with the research community.

Combining Knowledge Graphs and NLP to Analyze Instant Messaging Data in Criminal Investigations

TL;DR

The paper tackles the challenge of analyzing rich, heterogeneous instant messaging data in criminal investigations by coupling knowledge graphs with NLP to semantically enrich chat data and support investigators. The architecture combines a Neo4j-based knowledge graph, Whisper-driven audio transcriptions, and the NEEL NLP pipeline for robust entity extraction and linking, complemented by semantic search interfaces (Neo4j UI and DAVE). Real-world evaluation across two investigations demonstrates improved search and exploration capabilities, with investigators providing positive feedback while highlighting limitations and the need for domain-specific fine-tuning and expanded multimedia support. The work emphasizes human-in-the-loop verification, traceability of evidence, and secure, compliant data handling, outlining concrete future directions such as image/video enrichment and conversational search interfaces to further empower prosecutors and investigators.

Abstract

Criminal investigations often involve the analysis of messages exchanged through instant messaging apps such as WhatsApp, which can be an extremely effort-consuming task. Our approach integrates knowledge graphs and NLP models to support this analysis by semantically enriching data collected from suspects' mobile phones, and help prosecutors and investigators search into the data and get valuable insights. Our semantic enrichment process involves extracting message data and modeling it using a knowledge graph, generating transcriptions of voice messages, and annotating the data using an end-to-end entity extraction approach. We adopt two different solutions to help users get insights into the data, one based on querying and visualizing the graph, and one based on semantic search. The proposed approach ensures that users can verify the information by accessing the original data. While we report about early results and prototypes developed in the context of an ongoing project, our proposal has undergone practical applications with real investigation data. As a consequence, we had the chance to interact closely with prosecutors, collecting positive feedback but also identifying interesting opportunities as well as promising research directions to share with the research community.

Paper Structure

This paper contains 12 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the proposed solution.
  • Figure 2: Graph schema.
  • Figure 3: The entity extraction pipeline on chat messages. NER identifies mentions of entities, NEL identifies candidate entities from the knowledge graph (KG). NIL prediction detects that steve brown, steve, and Tom are linked to wrong entities and assumes they are not in the KG (NIL), while Boston is linked to the KG. Entity clustering assigns steve brown and steve to the same cluster (NIL-1), as they refer to the same entity, and Tom to another cluster (NIL-2).
  • Figure 4: Analysis of the correspondences between Person Under Investigation and a second individual. We provide the translation for relationships in Italian.
  • Figure :