Table of Contents
Fetching ...

Challenges of Heterogeneity in Big Data: A Comparative Study of Classification in Large-Scale Structured and Unstructured Domains

González Trigueros Jesús Eduardo, Alonso Sánchez Alejandro, Muñoz Rivera Emilio, Peñarán Prieto Mariana Jaqueline, Mendoza González Camila Natalia

TL;DR

The paper investigates how Variety in Big Data shapes classification across structured and unstructured domains, introducing a dual-domain experimental framework with Epsilon (dense tabular) and Rest-Mex/IMDB (textual) pipelines. It compares extensive hyperparameter optimization methods in the structured setting and scalable, Spark-based processing in the unstructured setting, revealing a complexity paradox where simple, well-regularized linear models can outperform deep architectures in high-dimensional spaces, while robust feature engineering enables effective generalization in text-heavy, distributed contexts. A unified framework for algorithm selection based on data nature and infrastructure is proposed, with practical implications for Green AI and scalable evaluation. The study emphasizes that model performance in heterogeneous Big Data arises more from data representation and optimization feasibility than from architectural complexity alone, guiding future work toward hybrid pipelines and distributed hyperparameter optimization.

Abstract

This study analyzes the impact of heterogeneity ("Variety") in Big Data by comparing classification strategies across structured (Epsilon) and unstructured (Rest-Mex, IMDB) domains. A dual methodology was implemented: evolutionary and Bayesian hyperparameter optimization (Genetic Algorithms, Optuna) in Python for numerical data, and distributed processing in Apache Spark for massive textual corpora. The results reveal a "complexity paradox": in high-dimensional spaces, optimized linear models (SVM, Logistic Regression) outperformed deep architectures and Gradient Boosting. Conversely, in text-based domains, the constraints of distributed fine-tuning led to overfitting in complex models, whereas robust feature engineering -- specifically Transformer-based embeddings (ROBERTa) and Bayesian Target Encoding -- enabled simpler models to generalize effectively. This work provides a unified framework for algorithm selection based on data nature and infrastructure constraints.

Challenges of Heterogeneity in Big Data: A Comparative Study of Classification in Large-Scale Structured and Unstructured Domains

TL;DR

The paper investigates how Variety in Big Data shapes classification across structured and unstructured domains, introducing a dual-domain experimental framework with Epsilon (dense tabular) and Rest-Mex/IMDB (textual) pipelines. It compares extensive hyperparameter optimization methods in the structured setting and scalable, Spark-based processing in the unstructured setting, revealing a complexity paradox where simple, well-regularized linear models can outperform deep architectures in high-dimensional spaces, while robust feature engineering enables effective generalization in text-heavy, distributed contexts. A unified framework for algorithm selection based on data nature and infrastructure is proposed, with practical implications for Green AI and scalable evaluation. The study emphasizes that model performance in heterogeneous Big Data arises more from data representation and optimization feasibility than from architectural complexity alone, guiding future work toward hybrid pipelines and distributed hyperparameter optimization.

Abstract

This study analyzes the impact of heterogeneity ("Variety") in Big Data by comparing classification strategies across structured (Epsilon) and unstructured (Rest-Mex, IMDB) domains. A dual methodology was implemented: evolutionary and Bayesian hyperparameter optimization (Genetic Algorithms, Optuna) in Python for numerical data, and distributed processing in Apache Spark for massive textual corpora. The results reveal a "complexity paradox": in high-dimensional spaces, optimized linear models (SVM, Logistic Regression) outperformed deep architectures and Gradient Boosting. Conversely, in text-based domains, the constraints of distributed fine-tuning led to overfitting in complex models, whereas robust feature engineering -- specifically Transformer-based embeddings (ROBERTa) and Bayesian Target Encoding -- enabled simpler models to generalize effectively. This work provides a unified framework for algorithm selection based on data nature and infrastructure constraints.

Paper Structure

This paper contains 29 sections, 19 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Confusion Matrix of the Final Ensemble on Epsilon. The high true positive and true negative rates (main diagonal) demonstrate the generalization capability of the combined system.