Table of Contents
Fetching ...

On the performance of phonetic algorithms in microtext normalization

Yerai Doval, Manuel Vilares, Jesús Vilares

TL;DR

The paper tackles microtext normalization by evaluating how well a broad set of English phonetic algorithms can generate compact, high-quality candidate lists for standard words. It adopts a two-stage pipeline where phonetic-based candidate generation feeds a downstream selection component and uses public implementations to compare algorithms with metrics defined as $P$, $R$, and $F_1$, along with candidate-set size. The study reports that while overall $F_1$ scores are modest, Beider-Morse, Eudex, MRA, Metaphone, StatCan, Soundex, and Roger Root show varied strengths, with algorithm choice being highly dependent on the downstream candidate selection process and dictionary size. The findings provide practical guidance for designing microtext normalization systems and indicate avenues for improving candidate generation, such as leveraging spell-checkers or language-aware rules, and extending analysis to other languages. Overall, the work offers a comprehensive benchmarking of phonetic algorithms for microtext normalization and lays groundwork for more effective integration into NLP pipelines.

Abstract

User-generated content published on microblogging social networks constitutes a priceless source of information. However, microtexts usually deviate from the standard lexical and grammatical rules of the language, thus making its processing by traditional intelligent systems very difficult. As an answer, microtext normalization consists in transforming those non-standard microtexts into standard well-written texts as a preprocessing step, allowing traditional approaches to continue with their usual processing. Given the importance of phonetic phenomena in non-standard text formation, an essential element of the knowledge base of a normalizer would be the phonetic rules that encode these phenomena, which can be found in the so-called phonetic algorithms. In this work we experiment with a wide range of phonetic algorithms for the English language. The aim of this study is to determine the best phonetic algorithms within the context of candidate generation for microtext normalization. In other words, we intend to find those algorithms that taking as input non-standard terms to be normalized allow us to obtain as output the smallest possible sets of normalization candidates which still contain the corresponding target standard words. As it will be stated, the choice of the phonetic algorithm will depend heavily on the capabilities of the candidate selection mechanism which we usually find at the end of a microtext normalization pipeline. The faster it can make the right choices among big enough sets of candidates, the more we can sacrifice on the precision of the phonetic algorithms in favour of coverage in order to increase the overall performance of the normalization system. KEYWORDS: microtext normalization; phonetic algorithm; fuzzy matching; Twitter; texting

On the performance of phonetic algorithms in microtext normalization

TL;DR

The paper tackles microtext normalization by evaluating how well a broad set of English phonetic algorithms can generate compact, high-quality candidate lists for standard words. It adopts a two-stage pipeline where phonetic-based candidate generation feeds a downstream selection component and uses public implementations to compare algorithms with metrics defined as , , and , along with candidate-set size. The study reports that while overall scores are modest, Beider-Morse, Eudex, MRA, Metaphone, StatCan, Soundex, and Roger Root show varied strengths, with algorithm choice being highly dependent on the downstream candidate selection process and dictionary size. The findings provide practical guidance for designing microtext normalization systems and indicate avenues for improving candidate generation, such as leveraging spell-checkers or language-aware rules, and extending analysis to other languages. Overall, the work offers a comprehensive benchmarking of phonetic algorithms for microtext normalization and lays groundwork for more effective integration into NLP pipelines.

Abstract

User-generated content published on microblogging social networks constitutes a priceless source of information. However, microtexts usually deviate from the standard lexical and grammatical rules of the language, thus making its processing by traditional intelligent systems very difficult. As an answer, microtext normalization consists in transforming those non-standard microtexts into standard well-written texts as a preprocessing step, allowing traditional approaches to continue with their usual processing. Given the importance of phonetic phenomena in non-standard text formation, an essential element of the knowledge base of a normalizer would be the phonetic rules that encode these phenomena, which can be found in the so-called phonetic algorithms. In this work we experiment with a wide range of phonetic algorithms for the English language. The aim of this study is to determine the best phonetic algorithms within the context of candidate generation for microtext normalization. In other words, we intend to find those algorithms that taking as input non-standard terms to be normalized allow us to obtain as output the smallest possible sets of normalization candidates which still contain the corresponding target standard words. As it will be stated, the choice of the phonetic algorithm will depend heavily on the capabilities of the candidate selection mechanism which we usually find at the end of a microtext normalization pipeline. The faster it can make the right choices among big enough sets of candidates, the more we can sacrifice on the precision of the phonetic algorithms in favour of coverage in order to increase the overall performance of the normalization system. KEYWORDS: microtext normalization; phonetic algorithm; fuzzy matching; Twitter; texting
Paper Structure (26 sections, 3 equations, 7 tables)