LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models
Abhishek Arora, Melissa Dell
TL;DR
The paper tackles the gap between accessible string-matching tools and the power of transformer LLMs for record linkage in social science contexts. It reframes linkage as a text-retrieval problem using a knn-based, semantic-similarity framework with FAISS, offering an off-the-shelf toolkit, multilingual pre-trained models, and easy integration with Hugging Face or OpenAI models. Key contributions include a comprehensive model zoo, efficient retrieval-enabled APIs, and straightforward customization pipelines with positive or positive/negative training data and contrastive losses, all supported by a model hub for reproducibility. Empirically, custom-tuned LinkTransformer models consistently outperform Levenshtein and off-the-shelf semantic models across multilingual and historical datasets, demonstrating strong applicability to noisy, multi-field linkage and cross-lingual scenarios with potential for near-human accuracy in challenging cases. This work meaningfully broadens access to advanced record-linkage methods for researchers with limited DL expertise, enabling scalable, reproducible, and multilingual analyses across social science and government domains.
Abstract
Linking information across sources is fundamental to a variety of analyses in social science, business, and government. While large language models (LLMs) offer enormous promise for improving record linkage in noisy datasets, in many domains approximate string matching packages in popular softwares such as R and Stata remain predominant. These packages have clean, simple interfaces and can be easily extended to a diversity of languages. Our open-source package LinkTransformer aims to extend the familiarity and ease-of-use of popular string matching methods to deep learning. It is a general purpose package for record linkage with transformer LLMs that treats record linkage as a text retrieval problem. At its core is an off-the-shelf toolkit for applying transformer models to record linkage with four lines of code. LinkTransformer contains a rich repository of pre-trained transformer semantic similarity models for multiple languages and supports easy integration of any transformer language model from Hugging Face or OpenAI. It supports standard functionality such as blocking and linking on multiple noisy fields. LinkTransformer APIs also perform other common text data processing tasks, e.g., aggregation, noisy de-duplication, and translation-free cross-lingual linkage. Importantly, LinkTransformer also contains comprehensive tools for efficient model tuning, to facilitate different levels of customization when off-the-shelf models do not provide the required accuracy. Finally, to promote reusability, reproducibility, and extensibility, LinkTransformer makes it easy for users to contribute their custom-trained models to its model hub. By combining transformer language models with intuitive APIs that will be familiar to many users of popular string matching packages, LinkTransformer aims to democratize the benefits of LLMs among those who may be less familiar with deep learning frameworks.
