Table of Contents
Fetching ...

OpenSanctions Pairs: Large-Scale Entity Matching with LLMs

Chandler Smith, Magnus Sesodia, Friedrich Lindenberg, Christian Schroeder de Witt

Abstract

We release OpenSanctions Pairs, a large-scale entity matching benchmark derived from real-world international sanctions aggregation and analyst deduplication. The dataset contains 755,540 labeled pairs spanning 293 heterogeneous sources across 31 countries, with multilingual and cross-script names, noisy and missing attributes, and set-valued fields typical of compliance workflows. We benchmark a production rule-based matcher (nomenklatura RegressionV1 algorithm) against open- and closed-source LLMs in zero- and few-shot settings. Off-the-shelf LLMs substantially outperform the production rule-based baseline (91.33\% F1), reaching up to 98.95\% F1 (GPT-4o) and 98.23\% F1 with a locally deployable open model (DeepSeek-R1-Distill-Qwen-14B). DSPy MIPROv2 prompt optimization yields consistent but modest gains, while adding in-context examples provides little additional benefit and can degrade performance. Error analysis shows complementary failure modes: the rule-based system over-matches (high false positives), whereas LLMs primarily fail on cross-script transliteration and minor identifier/date inconsistencies. These results indicate that pairwise matching performance is approaching a practical ceiling in this setting, and motivate shifting effort toward pipeline components such as blocking, clustering, and uncertainty-aware review. Code available at https://github.com/chansmi/OSINT_entity_resolution

OpenSanctions Pairs: Large-Scale Entity Matching with LLMs

Abstract

We release OpenSanctions Pairs, a large-scale entity matching benchmark derived from real-world international sanctions aggregation and analyst deduplication. The dataset contains 755,540 labeled pairs spanning 293 heterogeneous sources across 31 countries, with multilingual and cross-script names, noisy and missing attributes, and set-valued fields typical of compliance workflows. We benchmark a production rule-based matcher (nomenklatura RegressionV1 algorithm) against open- and closed-source LLMs in zero- and few-shot settings. Off-the-shelf LLMs substantially outperform the production rule-based baseline (91.33\% F1), reaching up to 98.95\% F1 (GPT-4o) and 98.23\% F1 with a locally deployable open model (DeepSeek-R1-Distill-Qwen-14B). DSPy MIPROv2 prompt optimization yields consistent but modest gains, while adding in-context examples provides little additional benefit and can degrade performance. Error analysis shows complementary failure modes: the rule-based system over-matches (high false positives), whereas LLMs primarily fail on cross-script transliteration and minor identifier/date inconsistencies. These results indicate that pairwise matching performance is approaching a practical ceiling in this setting, and motivate shifting effort toward pipeline components such as blocking, clustering, and uncertainty-aware review. Code available at https://github.com/chansmi/OSINT_entity_resolution
Paper Structure (33 sections, 1 equation, 3 figures, 3 tables)

This paper contains 33 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Example entity pair illustrating the difficulty of pairwise matching. These two individuals on Pakistan's proscribed persons list share the exact same name, country, and sanctions program, yet they are different people with distinct national ID numbers and fathers. The Nomenklatura baseline assigned this pair a 0.98 match score, near the maximum possible, demonstrating how name-based matching fails for common names without unique identifiers.
  • Figure 2: Prompt comparison showing the full manual prompt (left, 170 words) versus the MIPROv2-optimized prompt (right, 50 words). The optimizer discovered a more concise framing while preserving the conflict-detection principle.
  • Figure 3: Unoptimized (0-shot) F1 score versus model release date for all evaluated models. The dashed grey line indicates the rule-based Nomenklatura baseline (91.3% F1); the blue dashed trendline is a linear fit across all LLM models. Over three years, LLM performance has risen from 92--94% F1 (early 2023) to 98--99% F1 (late 2024--2025), widening the gap over the static rule-based system from $\sim$2 F1 points to over 7.