Table of Contents
Fetching ...

Generalizable prediction of academic performance from short texts on social media

Ivan Smirnov

Abstract

It has already been established that digital traces can be used to predict various human attributes. In most cases, however, predictive models rely on features that are specific to a particular source of digital trace data. In contrast, short texts written by users $-$ tweets, posts, or comments $-$ are ubiquitous across multiple platforms. In this paper, we explore the predictive power of short texts with respect to the academic performance of their authors. We use data from a representative panel of Russian students that includes information about their educational outcomes and activity on a popular networking site, VK. We build a model to predict academic performance from users' posts on VK and then apply it to a different context. In particular, we show that the model could reproduce rankings of schools and universities from the posts of their students on social media. We also find that the same model could predict academic performance from tweets as well as from VK posts. The generalizability of a model trained on a relatively small data set could be explained by the use of continuous word representations trained on a much larger corpus of social media posts. This also allows for greater interpretability of model predictions.

Generalizable prediction of academic performance from short texts on social media

Abstract

It has already been established that digital traces can be used to predict various human attributes. In most cases, however, predictive models rely on features that are specific to a particular source of digital trace data. In contrast, short texts written by users tweets, posts, or comments are ubiquitous across multiple platforms. In this paper, we explore the predictive power of short texts with respect to the academic performance of their authors. We use data from a representative panel of Russian students that includes information about their educational outcomes and activity on a popular networking site, VK. We build a model to predict academic performance from users' posts on VK and then apply it to a different context. In particular, we show that the model could reproduce rankings of schools and universities from the posts of their students on social media. We also find that the same model could predict academic performance from tweets as well as from VK posts. The generalizability of a model trained on a relatively small data set could be explained by the use of continuous word representations trained on a much larger corpus of social media posts. This also allows for greater interpretability of model predictions.

Paper Structure

This paper contains 16 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Pearson correlation between common text features and academic performance. The use of capitalized words, emojis and exclamations (average number per post normalized by the post length in tokens) is negatively correlated with performance. The use of Latin characters, average post and word length, vocabulary size and entropy of users' texts are positively correlated with academic performance.
  • Figure 2: The predictive power of the model depending on the number of posts used per user. First, users with at least 20 posts were selected. Then, for each user, $N$ of their posts were selected to predict their academic performance ($N = 1, ..., 20$). The shaded region corresponds to the bootstrapped 90% confidence interval.
  • Figure 3: Correlation between the predicted and real performance of schools and universities. Pearson's correlation coefficients between predicted school scores and the USE scores of their graduates were computed for Saint-Petersburg (a), Samara (b) and Tomsk (c). The correlation between predicted university scores and the USE scores of their enrollees was also computed for the 100 largest Russian universities (d).
  • Figure 4: Comparison of predictions based on VK and Twitter data. While estimates from Twitter and VK vary for individual universities the overall performance of the model is similar for both cases. Note that the performance of the model is rather low due to the limited number of users per university for whom both VK and Twitter data is available.
  • Figure 5: t-SNE representation of the words with the highest and lowest scores from the training data set. High performing clusters (orange) include English words and words related to literature, physics, or thinking processes. Low performing clusters (green) include spelling errors and words related to horoscopes, military service, or cars and road accidents.
  • ...and 1 more figures