Table of Contents
Fetching ...

ML-driven detection and reduction of ballast information in multi-modal datasets

Yaroslav Solovko

TL;DR

A generalized, multimodal framework for ballast detection and reduction across structured, semi-structured, unstructured, and sparse data types is introduced and reveals distinct ballast typologies, and offers practical guidance for leaner, more efficient machine learning pipelines.

Abstract

Modern datasets often contain ballast as redundant or low-utility information that increases dimensionality, storage requirements, and computational cost without contributing meaningful analytical value. This study introduces a generalized, multimodal framework for ballast detection and reduction across structured, semi-structured, unstructured, and sparse data types. Using diverse datasets, entropy, mutual information, Lasso, SHAP, PCA, topic modelling, and embedding analysis are applied to identify and eliminate ballast features. A novel Ballast Score is proposed to integrate these signals into a unified, cross-modal pruning strategy. Experimental results demonstrate that significant portions of the feature space as often exceeding 70% in sparse or semi-structured data, can be pruned with minimal or even improved classification performance, along with substantial reductions in training time and memory footprint. The framework reveals distinct ballast typologies (e.g. statistical, semantic, infrastructural), and offers practical guidance for leaner, more efficient machine learning pipelines.

ML-driven detection and reduction of ballast information in multi-modal datasets

TL;DR

A generalized, multimodal framework for ballast detection and reduction across structured, semi-structured, unstructured, and sparse data types is introduced and reveals distinct ballast typologies, and offers practical guidance for leaner, more efficient machine learning pipelines.

Abstract

Modern datasets often contain ballast as redundant or low-utility information that increases dimensionality, storage requirements, and computational cost without contributing meaningful analytical value. This study introduces a generalized, multimodal framework for ballast detection and reduction across structured, semi-structured, unstructured, and sparse data types. Using diverse datasets, entropy, mutual information, Lasso, SHAP, PCA, topic modelling, and embedding analysis are applied to identify and eliminate ballast features. A novel Ballast Score is proposed to integrate these signals into a unified, cross-modal pruning strategy. Experimental results demonstrate that significant portions of the feature space as often exceeding 70% in sparse or semi-structured data, can be pruned with minimal or even improved classification performance, along with substantial reductions in training time and memory footprint. The framework reveals distinct ballast typologies (e.g. statistical, semantic, infrastructural), and offers practical guidance for leaner, more efficient machine learning pipelines.
Paper Structure (37 sections, 12 equations, 27 figures, 10 tables)

This paper contains 37 sections, 12 equations, 27 figures, 10 tables.

Figures (27)

  • Figure 1: Shannon Entropy distribution of numeric features.
  • Figure 2: Mutual Information Scores.
  • Figure 3: PCA Scatterplot of Transaction Data.
  • Figure 4: Top 50 Feature Correlation Heatmap.
  • Figure 5: Feature Overlap Between SHAP and Lasso Selections.
  • ...and 22 more figures