Table of Contents
Fetching ...

PoPreRo: A New Dataset for Popularity Prediction of Romanian Reddit Posts

Ana-Cristina Rogoz, Maria Ilinca Nechita, Radu Tudor Ionescu

TL;DR

PoPreRo is a valuable resource that can be used to evaluate models on predicting the popularity of social media posts in Romanian, and a set of competitive models to be used as baselines for future research are introduced.

Abstract

We introduce PoPreRo, the first dataset for Popularity Prediction of Romanian posts collected from Reddit. The PoPreRo dataset includes a varied compilation of post samples from five distinct subreddits of Romania, totaling 28,107 data samples. Along with our novel dataset, we introduce a set of competitive models to be used as baselines for future research. Interestingly, the top-scoring model achieves an accuracy of 61.35% and a macro F1 score of 60.60% on the test set, indicating that the popularity prediction task on PoPreRo is very challenging. Further investigations based on few-shot prompting the Falcon-7B Large Language Model also point in the same direction. We thus believe that PoPreRo is a valuable resource that can be used to evaluate models on predicting the popularity of social media posts in Romanian. We release our dataset at https://github.com/ana-rogoz/PoPreRo.

PoPreRo: A New Dataset for Popularity Prediction of Romanian Reddit Posts

TL;DR

PoPreRo is a valuable resource that can be used to evaluate models on predicting the popularity of social media posts in Romanian, and a set of competitive models to be used as baselines for future research are introduced.

Abstract

We introduce PoPreRo, the first dataset for Popularity Prediction of Romanian posts collected from Reddit. The PoPreRo dataset includes a varied compilation of post samples from five distinct subreddits of Romania, totaling 28,107 data samples. Along with our novel dataset, we introduce a set of competitive models to be used as baselines for future research. Interestingly, the top-scoring model achieves an accuracy of 61.35% and a macro F1 score of 60.60% on the test set, indicating that the popularity prediction task on PoPreRo is very challenging. Further investigations based on few-shot prompting the Falcon-7B Large Language Model also point in the same direction. We thus believe that PoPreRo is a valuable resource that can be used to evaluate models on predicting the popularity of social media posts in Romanian. We release our dataset at https://github.com/ana-rogoz/PoPreRo.
Paper Structure (20 sections, 1 figure, 6 tables)

This paper contains 20 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Number of samples (#posts) for each label (popular/unpopular), distributed by the time of posting. The 24 hours in a day are divided into six four-hour intervals. Best viewed in color.