Employing Two-Dimensional Word Embedding for Difficult Tabular Data Stream Classification

Paweł Zyblewski

Employing Two-Dimensional Word Embedding for Difficult Tabular Data Stream Classification

Paweł Zyblewski

TL;DR

This work addresses the challenge of classifying difficult data streams characterized by concept drift and class imbalance. It introduces Streaming Super Tabular Machine Learning (SSTML), which encodes each data chunk into 2D STML images and classifies them with a single epoch of a ResNet-18 CNN in a batch-based setting. Across synthetic, semi-synthetic, and real streams, SSTML achieves statistically significant improvements over state-of-the-art ensemble methods while maintaining competitive processing times, demonstrating the viability of deep CNN-based image representations for tabular streams. The study opens avenues for exploring more MD Encoding options, neural architectures, and loss functions to further enhance performance in streaming tabular data analysis.

Abstract

Rapid technological advances are inherently linked to the increased amount of data, a substantial portion of which can be interpreted as data stream, capable of exhibiting the phenomenon of concept drift and having a high imbalance ratio. Consequently, developing new approaches to classifying difficult data streams is a rapidly growing research area. At the same time, the proliferation of deep learning and transfer learning, as well as the success of convolutional neural networks in computer vision tasks, have contributed to the emergence of a new research trend, namely Multi-Dimensional Encoding (MDE), focusing on transforming tabular data into a homogeneous form of a discrete digital signal. This paper proposes Streaming Super Tabular Machine Learning (SSTML), thereby exploring for the first time the potential of MDE in the difficult data stream classification task. SSTML encodes consecutive data chunks into an image representation using the STML algorithm and then performs a single ResNet-18 training epoch. Experiments conducted on synthetic and real data streams have demonstrated the ability of SSTML to achieve classification quality statistically significantly superior to state-of-the-art algorithms while maintaining comparable processing time.

Employing Two-Dimensional Word Embedding for Difficult Tabular Data Stream Classification

TL;DR

Abstract

Paper Structure (12 sections, 8 figures, 1 table)

This paper contains 12 sections, 8 figures, 1 table.

Introduction
Related Works
Classifier ensemble for difficult data stream
Multi-Dimensional Encoding
Neural networks for tabular data stream analysis
Streaming Super Tabular Machine Learning
Experimental Evaluation
Set-up
Experiment 1 -- Synthetic data streams
Experiment 2 -- Semi-synthetic and real data streams
Experiment 3 -- Processing time
Conclusion

Figures (8)

Figure 1: An example of encoding a single instance of a synthetically generated tabular problem into a two-dimensional discrete digital signal using STML and IGTD techniques.
Figure 2: The general scheme of the proposed SSTML approach.
Figure 3: Comparison of SSTML behavior with and without transfer learning. A Gaussian filter was applied for visualization purposes.
Figure 4: Comparison of SSTML with reference methods in terms of BAC on synthetic streams. A Gaussian filter was applied for visualization purposes.
Figure 5: Comparison of SSTML with reference methods in terms of BAC on synthetic streams. A Gaussian filter was applied for visualization purposes.
...and 3 more figures

Employing Two-Dimensional Word Embedding for Difficult Tabular Data Stream Classification

TL;DR

Abstract

Employing Two-Dimensional Word Embedding for Difficult Tabular Data Stream Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (8)