Table of Contents
Fetching ...

Celebrity Profiling on Short Urdu Text using Twitter Followers' Feed

Muhammad Hamza, Rizwan Jafar

TL;DR

This work tackles celebrity profiling in Urdu by leveraging short Urdu tweets from followers to predict celebrity demographics (age, gender, occupation, fame). It builds a labeled Urdu corpus through follower data and applies a spectrum of ML and DL models (e.g., SVM, Random Forest, CNN, LSTM) evaluated with accuracy, precision, recall, F1, and cRank. Gender prediction emerges as the most robust task (cRank up to ~0.63 with ML and ~0.79 with DL), while age, occupation, and fame yield more moderate results, partly due to data imbalance and language-specific challenges. The study demonstrates the feasibility of cross-user, follower-based linguistic signals for Urdu demographic profiling and provides a baseline dataset and methodology for low-resource language analytics with potential applications in marketing, forensics, and social media moderation.

Abstract

Social media has become an essential part of the digital age, serving as a platform for communication, interaction, and information sharing. Celebrities are among the most active users and often reveal aspects of their personal and professional lives through online posts. Platforms such as Twitter provide an opportunity to analyze language and behavior for understanding demographic and social patterns. Since followers frequently share linguistic traits and interests with the celebrities they follow, textual data from followers can be used to predict celebrity demographics. However, most existing research in this field has focused on English and other high-resource languages, leaving Urdu largely unexplored. This study applies modern machine learning and deep learning techniques to the problem of celebrity profiling in Urdu. A dataset of short Urdu tweets from followers of subcontinent celebrities was collected and preprocessed. Multiple algorithms were trained and compared, including Logistic Regression, Support Vector Machines, Random Forests, Convolutional Neural Networks, and Long Short-Term Memory networks. The models were evaluated using accuracy, precision, recall, F1-score, and cumulative rank (cRank). The best performance was achieved for gender prediction with a cRank of 0.65 and an accuracy of 0.65, followed by moderate results for age, profession, and fame prediction. These results demonstrate that follower-based linguistic features can be effectively leveraged using machine learning and neural approaches for demographic prediction in Urdu, a low-resource language.

Celebrity Profiling on Short Urdu Text using Twitter Followers' Feed

TL;DR

This work tackles celebrity profiling in Urdu by leveraging short Urdu tweets from followers to predict celebrity demographics (age, gender, occupation, fame). It builds a labeled Urdu corpus through follower data and applies a spectrum of ML and DL models (e.g., SVM, Random Forest, CNN, LSTM) evaluated with accuracy, precision, recall, F1, and cRank. Gender prediction emerges as the most robust task (cRank up to ~0.63 with ML and ~0.79 with DL), while age, occupation, and fame yield more moderate results, partly due to data imbalance and language-specific challenges. The study demonstrates the feasibility of cross-user, follower-based linguistic signals for Urdu demographic profiling and provides a baseline dataset and methodology for low-resource language analytics with potential applications in marketing, forensics, and social media moderation.

Abstract

Social media has become an essential part of the digital age, serving as a platform for communication, interaction, and information sharing. Celebrities are among the most active users and often reveal aspects of their personal and professional lives through online posts. Platforms such as Twitter provide an opportunity to analyze language and behavior for understanding demographic and social patterns. Since followers frequently share linguistic traits and interests with the celebrities they follow, textual data from followers can be used to predict celebrity demographics. However, most existing research in this field has focused on English and other high-resource languages, leaving Urdu largely unexplored. This study applies modern machine learning and deep learning techniques to the problem of celebrity profiling in Urdu. A dataset of short Urdu tweets from followers of subcontinent celebrities was collected and preprocessed. Multiple algorithms were trained and compared, including Logistic Regression, Support Vector Machines, Random Forests, Convolutional Neural Networks, and Long Short-Term Memory networks. The models were evaluated using accuracy, precision, recall, F1-score, and cumulative rank (cRank). The best performance was achieved for gender prediction with a cRank of 0.65 and an accuracy of 0.65, followed by moderate results for age, profession, and fame prediction. These results demonstrate that follower-based linguistic features can be effectively leveraged using machine learning and neural approaches for demographic prediction in Urdu, a low-resource language.

Paper Structure

This paper contains 28 sections, 2 equations, 11 tables.