Table of Contents
Fetching ...

Tide: A Customisable Dataset Generator for Anti-Money Laundering Research

Montijn van den Beukel, Jože Martin Rožanec, Ana-Lucia Varbanescu

TL;DR

Tide is presented, an open-source synthetic dataset generator that produces graph-based financial networks incorporating money laundering patterns defined by both structural and temporal characteristics, advancing the development of robust AML detection methods.

Abstract

The lack of accessible transactional data significantly hinders machine learning research for Anti-Money Laundering (AML). Privacy and legal concerns prevent the sharing of real financial data, while existing synthetic generators focus on simplistic structural patterns and neglect the temporal dynamics (timing and frequency) that characterise sophisticated laundering schemes. We present Tide, an open-source synthetic dataset generator that produces graph-based financial networks incorporating money laundering patterns defined by both structural and temporal characteristics. Tide enables reproducible, customisable dataset generation tailored to specific research needs. We release two reference datasets with varying illicit ratios (LI: 0.10\%, HI: 0.19\%), alongside the implementation of state-of-the-art detection models. Evaluation across these datasets reveals condition-dependent model rankings: LightGBM achieves the highest PR-AUC (78.05) in the low illicit ratio condition, while XGBoost performs best (85.12) at higher fraud prevalence. These divergent rankings demonstrate that the reference datasets can meaningfully differentiate model capabilities across operational conditions. Tide provides the research community with a configurable benchmark that exposes meaningful performance variation across model architectures, advancing the development of robust AML detection methods.

Tide: A Customisable Dataset Generator for Anti-Money Laundering Research

TL;DR

Tide is presented, an open-source synthetic dataset generator that produces graph-based financial networks incorporating money laundering patterns defined by both structural and temporal characteristics, advancing the development of robust AML detection methods.

Abstract

The lack of accessible transactional data significantly hinders machine learning research for Anti-Money Laundering (AML). Privacy and legal concerns prevent the sharing of real financial data, while existing synthetic generators focus on simplistic structural patterns and neglect the temporal dynamics (timing and frequency) that characterise sophisticated laundering schemes. We present Tide, an open-source synthetic dataset generator that produces graph-based financial networks incorporating money laundering patterns defined by both structural and temporal characteristics. Tide enables reproducible, customisable dataset generation tailored to specific research needs. We release two reference datasets with varying illicit ratios (LI: 0.10\%, HI: 0.19\%), alongside the implementation of state-of-the-art detection models. Evaluation across these datasets reveals condition-dependent model rankings: LightGBM achieves the highest PR-AUC (78.05) in the low illicit ratio condition, while XGBoost performs best (85.12) at higher fraud prevalence. These divergent rankings demonstrate that the reference datasets can meaningfully differentiate model capabilities across operational conditions. Tide provides the research community with a configurable benchmark that exposes meaningful performance variation across model architectures, advancing the development of robust AML detection methods.
Paper Structure (102 sections, 6 equations, 11 figures, 24 tables)

This paper contains 102 sections, 6 equations, 11 figures, 24 tables.

Figures (11)

  • Figure 1: The iterative research methodology. The process employs an adversarial feedback loop where generator parameters (Phase 2) are recalibrated based on the performance of detection models (Phase 4).
  • Figure 2: The four main steps of the Tide graph generation process: (a) Entity creation and clustering, (b) Entity selection, (c) Transaction sequence generation, and (d) Pattern aggregation.
  • Figure 3: Visualisation of the overseas transfer pattern. Individual_1 owns Account_1.
  • Figure 4: Visualisation of the rapid fund movement pattern. Individual_1 owns Account_1 and Account_2.
  • Figure 5: Visualisation of the front business pattern. Business_1 owns Account_1 and Account_2.
  • ...and 6 more figures