Employing Sentence Space Embedding for Classification of Data Stream from Fake News Domain

Paweł Zyblewski; Jakub Klikowski; Weronika Borek-Marciniec; Paweł Ksieniewicz

Employing Sentence Space Embedding for Classification of Data Stream from Fake News Domain

Paweł Zyblewski, Jakub Klikowski, Weronika Borek-Marciniec, Paweł Ksieniewicz

TL;DR

The paper tackles real-time textual data stream classification, focusing on fake-news detection under dynamic class imbalance. Data streams are partitioned into chunks $DS^T_k$ of size $N$, which are encoded into 2D discrete signals $DS^I_k$ used as CNN inputs. Streaming Sentence Space (sss) is evaluated with a ResNet-18 classifier in a batch-based Test-Then-Train setting on the Fakeddit dataset, showing strong generalization. SSS outperforms state-of-the-art data-stream ensembles in balanced accuracy and exhibits lower time complexity, illustrating the viability of DL for data streams and guiding future work on sentence-space encodings and multimodal streams.

Abstract

Tabular data is considered the last unconquered castle of deep learning, yet the task of data stream classification is stated to be an equally important and demanding research area. Due to the temporal constraints, it is assumed that deep learning methods are not the optimal solution for application in this field. However, excluding the entire -- and prevalent -- group of methods seems rather rash given the progress that has been made in recent years in its development. For this reason, the following paper is the first to present an approach to natural language data stream classification using the sentence space method, which allows for encoding text into the form of a discrete digital signal. This allows the use of convolutional deep networks dedicated to image classification to solve the task of recognizing fake news based on text data. Based on the real-life Fakeddit dataset, the proposed approach was compared with state-of-the-art algorithms for data stream classification based on generalization ability and time complexity.

Employing Sentence Space Embedding for Classification of Data Stream from Fake News Domain

TL;DR

The paper tackles real-time textual data stream classification, focusing on fake-news detection under dynamic class imbalance. Data streams are partitioned into chunks

of size

, which are encoded into 2D discrete signals

used as CNN inputs. Streaming Sentence Space (sss) is evaluated with a ResNet-18 classifier in a batch-based Test-Then-Train setting on the Fakeddit dataset, showing strong generalization. SSS outperforms state-of-the-art data-stream ensembles in balanced accuracy and exhibits lower time complexity, illustrating the viability of DL for data streams and guiding future work on sentence-space encodings and multimodal streams.

Abstract

Paper Structure (14 sections, 7 figures)

This paper contains 14 sections, 7 figures.

Introduction
Related works
Text data extraction methods
Multi-Dimensional Encoding of text data
Classifier ensemble for imbalanced data stream
Streaming Sentence Space
Experimental Evaluation
Set-up
Experiment scenarios
Experiment 1 -- Extraction methods
Experiment 2 -- Comparison with data stream classification algorithms
Experiment 3 -- Time complexity
Conclusions
Acknowledgments.

Figures (7)

Figure 1: The general scheme of the proposed SSS approach.
Figure 2: Results of preliminary experiments related to image size and transfer learning.
Figure 3: Changes in the prior class probabilities over time.
Figure 4: Results of an experiment to determine the best extraction method for SSS.
Figure 5: Comparison of SSS with reference methods depending on the extraction method used.
...and 2 more figures

Employing Sentence Space Embedding for Classification of Data Stream from Fake News Domain

TL;DR

Abstract

Employing Sentence Space Embedding for Classification of Data Stream from Fake News Domain

Authors

TL;DR

Abstract

Table of Contents

Figures (7)