Table of Contents
Fetching ...

Methods for Matching English Language Addresses

Keshav Ramani, Daniel Borrajo

TL;DR

This work tackles English address matching by constructing a robust synthetic dataset that captures real-world address variations and noise. It introduces a comprehensive data-generation pipeline (base addresses, prefixes, and matching/mismatching transformations) and evaluates a range of matching approaches, from traditional string-distance baselines to an ESIM-based deep model with optional character embeddings. The ESIM+Char model achieves the best overall accuracy, aided by segmentation and character-aware representations, though training is more resource-intensive and may face generalization challenges across domains. The study provides a practical framework for dataset creation and a thorough comparison that informs the selection of methods for address matching in applications like mail routing and entity resolution. Future work includes integrating transformer-based embeddings (e.g., BERT) and enhancing realism and granularity in data generation.

Abstract

Addresses occupy a niche location within the landscape of textual data, due to the positional importance carried by every word, and the geographical scope it refers to. The task of matching addresses happens everyday and is present in various fields like mail redirection, entity resolution, etc. Our work defines, and formalizes a framework to generate matching and mismatching pairs of addresses in the English language, and use it to evaluate various methods to automatically perform address matching. These methods vary widely from distance based approaches to deep learning models. By studying the Precision, Recall and Accuracy metrics of these approaches, we obtain an understanding of the best suited method for this setting of the address matching task.

Methods for Matching English Language Addresses

TL;DR

This work tackles English address matching by constructing a robust synthetic dataset that captures real-world address variations and noise. It introduces a comprehensive data-generation pipeline (base addresses, prefixes, and matching/mismatching transformations) and evaluates a range of matching approaches, from traditional string-distance baselines to an ESIM-based deep model with optional character embeddings. The ESIM+Char model achieves the best overall accuracy, aided by segmentation and character-aware representations, though training is more resource-intensive and may face generalization challenges across domains. The study provides a practical framework for dataset creation and a thorough comparison that informs the selection of methods for address matching in applications like mail routing and entity resolution. Future work includes integrating transformer-based embeddings (e.g., BERT) and enhancing realism and granularity in data generation.

Abstract

Addresses occupy a niche location within the landscape of textual data, due to the positional importance carried by every word, and the geographical scope it refers to. The task of matching addresses happens everyday and is present in various fields like mail redirection, entity resolution, etc. Our work defines, and formalizes a framework to generate matching and mismatching pairs of addresses in the English language, and use it to evaluate various methods to automatically perform address matching. These methods vary widely from distance based approaches to deep learning models. By studying the Precision, Recall and Accuracy metrics of these approaches, we obtain an understanding of the best suited method for this setting of the address matching task.
Paper Structure (22 sections, 1 equation, 5 figures, 7 tables)

This paper contains 22 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Examples of matching address generation with randomly chosen transformations.
  • Figure 2: Examples of mismatching address generation with randomly chosen transformations.
  • Figure 3: Word vector computation. The words in an address are passed through a glove embedding layer and the outputs from thereon are passed to a Bi-LSTM layer. As Chen et al. noted, Glove pennington2014glove can be a good choice of word embeddings for the ESIM model.
  • Figure 4: Character vector computation. The computation of character vectors is very similar to that of word vectors, except pre-trained embeddings aren't used.
  • Figure 5: The modified ESIM architecture. The original version of ESIM formulated by Chen et al.,Chen_2017 did not contain the Character vectors and only worked with word vectors and embeddings. Dong et aldong2018enhance later studied the impact of adding character emebeddings, as shown in this figure in the context of next utterance selection in dialogues. We analyze the effectiveness of the ESIM model with and without the character embeddings for the task of address matching. The computations of word vectors and character vectors are shown in Figure \ref{['fig:word_vectors']} and Figure \ref{['fig:character_vectors']}.