Table of Contents
Fetching ...

QiBERT -- Classifying Online Conversations Messages with BERT as a Feature

Bruno D. Ferreira-Saraiva, Zuil Pirola, João P. Matos-Carvalho, Manuel Marques-Pita

TL;DR

This project used the state of the art (SoA) Machine Learning algorithms and methods, through BERT based models to classify if utterances are in or out of the debate subject, and achieved results above 0.95 average accuracy for classifying online messages.

Abstract

Recent developments in online communication and their usage in everyday life have caused an explosion in the amount of a new genre of text data, short text. Thus, the need to classify this type of text based on its content has a significant implication in many areas. Online debates are no exception, once these provide access to information about opinions, positions and preferences of its users. This paper aims to use data obtained from online social conversations in Portuguese schools (short text) to observe behavioural trends and to see if students remain engaged in the discussion when stimulated. This project used the state of the art (SoA) Machine Learning (ML) algorithms and methods, through BERT based models to classify if utterances are in or out of the debate subject. Using SBERT embeddings as a feature, with supervised learning, the proposed model achieved results above 0.95 average accuracy for classifying online messages. Such improvements can help social scientists better understand human communication, behaviour, discussion and persuasion.

QiBERT -- Classifying Online Conversations Messages with BERT as a Feature

TL;DR

This project used the state of the art (SoA) Machine Learning algorithms and methods, through BERT based models to classify if utterances are in or out of the debate subject, and achieved results above 0.95 average accuracy for classifying online messages.

Abstract

Recent developments in online communication and their usage in everyday life have caused an explosion in the amount of a new genre of text data, short text. Thus, the need to classify this type of text based on its content has a significant implication in many areas. Online debates are no exception, once these provide access to information about opinions, positions and preferences of its users. This paper aims to use data obtained from online social conversations in Portuguese schools (short text) to observe behavioural trends and to see if students remain engaged in the discussion when stimulated. This project used the state of the art (SoA) Machine Learning (ML) algorithms and methods, through BERT based models to classify if utterances are in or out of the debate subject. Using SBERT embeddings as a feature, with supervised learning, the proposed model achieved results above 0.95 average accuracy for classifying online messages. Such improvements can help social scientists better understand human communication, behaviour, discussion and persuasion.
Paper Structure (10 sections, 7 figures, 2 tables)

This paper contains 10 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Comparison between "Racism" and "Other" subject sentence embeddings.
  • Figure 2: Proposed system architecture.
  • Figure 3: Feature selector method. Adapted from stoppiglia2003ranking.
  • Figure 4: Comparison between Complete (CAg) and Majority (MAg) Agreement.
  • Figure 5: Comparison between different Machine Learning Models: a) without features reduced and b) features reduced.
  • ...and 2 more figures