Methods for Matching English Language Addresses
Keshav Ramani, Daniel Borrajo
TL;DR
This work tackles English address matching by constructing a robust synthetic dataset that captures real-world address variations and noise. It introduces a comprehensive data-generation pipeline (base addresses, prefixes, and matching/mismatching transformations) and evaluates a range of matching approaches, from traditional string-distance baselines to an ESIM-based deep model with optional character embeddings. The ESIM+Char model achieves the best overall accuracy, aided by segmentation and character-aware representations, though training is more resource-intensive and may face generalization challenges across domains. The study provides a practical framework for dataset creation and a thorough comparison that informs the selection of methods for address matching in applications like mail routing and entity resolution. Future work includes integrating transformer-based embeddings (e.g., BERT) and enhancing realism and granularity in data generation.
Abstract
Addresses occupy a niche location within the landscape of textual data, due to the positional importance carried by every word, and the geographical scope it refers to. The task of matching addresses happens everyday and is present in various fields like mail redirection, entity resolution, etc. Our work defines, and formalizes a framework to generate matching and mismatching pairs of addresses in the English language, and use it to evaluate various methods to automatically perform address matching. These methods vary widely from distance based approaches to deep learning models. By studying the Precision, Recall and Accuracy metrics of these approaches, we obtain an understanding of the best suited method for this setting of the address matching task.
