MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

Dwip Dalal; Vivek Srivastava; Mayank Singh

MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

Dwip Dalal, Vivek Srivastava, Mayank Singh

TL;DR

A large-scale multilingual and multi-topic dataset MMT collected from Twitter, encompassing 13 coarse-grained and 63 fine-graining topics in the Indian context, is introduced and annotated with various Indian languages and their code-mixed counterparts.

Abstract

Social media plays a significant role in cross-cultural communication. A vast amount of this occurs in code-mixed and multilingual form, posing a significant challenge to Natural Language Processing (NLP) tools for processing such information, like language identification, topic modeling, and named-entity recognition. To address this, we introduce a large-scale multilingual, and multi-topic dataset (MMT) collected from Twitter (1.7 million Tweets), encompassing 13 coarse-grained and 63 fine-grained topics in the Indian context. We further annotate a subset of 5,346 tweets from the MMT dataset with various Indian languages and their code-mixed counterparts. Also, we demonstrate that the currently existing tools fail to capture the linguistic diversity in MMT on two downstream tasks, i.e., topic modeling and language identification. To facilitate future research, we have make the anonymized and annotated dataset available at https://huggingface.co/datasets/LingoIITGN/MMT.

MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

TL;DR

Abstract

Paper Structure (13 sections, 2 figures, 5 tables)

This paper contains 13 sections, 2 figures, 5 tables.

Introduction
Constructing The Multilingual and Multi-topic Dataset
MMT
MMT-LID
Dataset Analysis
Answering the Pertinent Questions
RQ1: how do traditional topic modeling tools perform in multilingual settings?
Inferring topics in MMT dataset
Inferring topics in MMT-LID dataset
RQ2: can we achieve better topic modeling with the cross-lingual contextual topic model (CTM)?
RQ3: how do multilingual language identification tools perform in the multi-topical text?
Limitations and Future Works
Concluding Remarks

Figures (2)

Figure 1: Tweets from the MMT-LID dataset with language tags from Twitter and the human annotator.
Figure 2: Distribution of language annotation by human annotators in the MMT-LID dataset. Here, we report the top-5 identified languages by the human annotators in the MMT-LID dataset. Here, Correct shows the number of tweets with correct language identification by Twitter. The column name En and Hi show the language identified by Twitter. CMs show all tweets in code-mixed languages.

MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

TL;DR

Abstract

MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (2)