Leveraging Large Language Models for Fuzzy String Matching in Political Science

Yu Wang

Leveraging Large Language Models for Fuzzy String Matching in Political Science

Yu Wang

TL;DR

This letter proposes to use large language models to entirely sidestep the problem of fuzzy string matching by improving the state of the art by as much as 39% in terms of average precision while being substantially easier and more intuitive to use by political scientists.

Abstract

Fuzzy string matching remains a key issue when political scientists combine data from different sources. Existing matching methods invariably rely on string distances, such as Levenshtein distance and cosine similarity. As such, they are inherently incapable of matching strings that refer to the same entity with different names such as ''JP Morgan'' and ''Chase Bank'', ''DPRK'' and ''North Korea'', ''Chuck Fleischmann (R)'' and ''Charles Fleischmann (R)''. In this letter, we propose to use large language models to entirely sidestep this problem in an easy and intuitive manner. Extensive experiments show that our proposed methods can improve the state of the art by as much as 39% in terms of average precision while being substantially easier and more intuitive to use by political scientists. Moreover, our results are robust against various temperatures. We further note that enhanced prompting can lead to additional performance improvements.

Leveraging Large Language Models for Fuzzy String Matching in Political Science

TL;DR

Abstract

Paper Structure (11 sections, 2 figures, 1 table)

This paper contains 11 sections, 2 figures, 1 table.

Introduction
Results
Discussion
Materials and methods
Data availability
Ethical approval

Figures (2)

Figure 1: ChatGPT substantially outperforms character-based matching methods. The zero-shot ChatGPTs, at temperatures 0.2 and 1, both outperform character-based methods by a large margin.
Figure 2: ChatGPT again outperforms character-based matching methods. In particular, when we provide more context in the prompt (p2), zero-shot ChatGPTs, at temperatures 0.2 and 1.0, both outperform character-based methods by a large margin and achieve 100% precision.

Leveraging Large Language Models for Fuzzy String Matching in Political Science

TL;DR

Abstract

Leveraging Large Language Models for Fuzzy String Matching in Political Science

Authors

TL;DR

Abstract

Table of Contents

Figures (2)