Table of Contents
Fetching ...

All that is English may be Hindi: Enhancing language identification through automatic ranking of likeliness of word borrowing in social media

Jasabanta Patro, Bidisha Samanta, Saurabh Singh, Abhipsa Basu, Prithwish Mukherjee, Monojit Choudhury, Animesh Mukherjee

TL;DR

The paper tackles automatic estimation of word borrowing likelihood in Hindi–English social media by defining three Twitter-derived signals (UUR, UTR, UPR) and comparing them against a baseline $log\left(\frac{F_{L_2}}{F_{L_1}}\right)$. The authors build a large Hindi–English tweet corpus, derive ground-truth borrowing judgments via LPF-based surveys, and demonstrate that the proposed metrics achieve a Spearman $\rho$ of about $0.62$, more than twice the baseline's $0.26$, with strong performance in younger and low-mixing user groups. They also report 88% re-annotation accuracy for surely borrowed words, supporting the practical value of these signals for automatic language tagging. The work offers substantial gains for language identification in multilingual social media and suggests avenues for real-time tagging and broader multilingual NLP tasks.

Abstract

In this paper, we present a set of computational methods to identify the likeliness of a word being borrowed, based on the signals from social media. In terms of Spearman correlation coefficient values, our methods perform more than two times better (nearly 0.62) in predicting the borrowing likeliness compared to the best performing baseline (nearly 0.26) reported in literature. Based on this likeliness estimate we asked annotators to re-annotate the language tags of foreign words in predominantly native contexts. In 88 percent of cases the annotators felt that the foreign language tag should be replaced by native language tag, thus indicating a huge scope for improvement of automatic language identification systems.

All that is English may be Hindi: Enhancing language identification through automatic ranking of likeliness of word borrowing in social media

TL;DR

The paper tackles automatic estimation of word borrowing likelihood in Hindi–English social media by defining three Twitter-derived signals (UUR, UTR, UPR) and comparing them against a baseline . The authors build a large Hindi–English tweet corpus, derive ground-truth borrowing judgments via LPF-based surveys, and demonstrate that the proposed metrics achieve a Spearman of about , more than twice the baseline's , with strong performance in younger and low-mixing user groups. They also report 88% re-annotation accuracy for surely borrowed words, supporting the practical value of these signals for automatic language tagging. The work offers substantial gains for language identification in multilingual social media and suggests avenues for real-time tagging and broader multilingual NLP tasks.

Abstract

In this paper, we present a set of computational methods to identify the likeliness of a word being borrowed, based on the signals from social media. In terms of Spearman correlation coefficient values, our methods perform more than two times better (nearly 0.62) in predicting the borrowing likeliness compared to the best performing baseline (nearly 0.26) reported in literature. Based on this likeliness estimate we asked annotators to re-annotate the language tags of foreign words in predominantly native contexts. In 88 percent of cases the annotators felt that the foreign language tag should be replaced by native language tag, thus indicating a huge scope for improvement of automatic language identification systems.

Paper Structure

This paper contains 17 sections, 3 equations, 9 tables.