Computationally and Memory-Efficient Robust Predictive Analytics Using Big Data
Daniel Menges, Adil Rasheed
TL;DR
The paper tackles robust predictive analytics under data uncertainties and storage constraints in big data contexts. It presents a pipeline that fuses Robust Principal Component Analysis (RPCA) for denoising, Optimal Sensor Placement (OSP) for information-rich data compression, and Long Short-Term Memory (LSTM) networks trained on low-dimensional measurements to forecast future states. By reconstructing full-dimensional data from a compact sensor subset, the approach achieves robust data cleaning, substantial memory savings, and accelerated training while maintaining predictive accuracy. The methods are demonstrated on real thermal-imaging data of a ship's engine, showing real-time potential for predictive maintenance and operational insight in data-rich industrial settings.
Abstract
In the current data-intensive era, big data has become a significant asset for Artificial Intelligence (AI), serving as a foundation for developing data-driven models and providing insight into various unknown fields. This study navigates through the challenges of data uncertainties, storage limitations, and predictive data-driven modeling using big data. We utilize Robust Principal Component Analysis (RPCA) for effective noise reduction and outlier elimination, and Optimal Sensor Placement (OSP) for efficient data compression and storage. The proposed OSP technique enables data compression without substantial information loss while simultaneously reducing storage needs. While RPCA offers an enhanced alternative to traditional Principal Component Analysis (PCA) for high-dimensional data management, the scope of this work extends its utilization, focusing on robust, data-driven modeling applicable to huge data sets in real-time. For that purpose, Long Short-Term Memory (LSTM) networks, a type of recurrent neural network, are applied to model and predict data based on a low-dimensional subset obtained from OSP, leading to a crucial acceleration of the training phase. LSTMs are feasible for capturing long-term dependencies in time series data, making them particularly suited for predicting the future states of physical systems on historical data. All the presented algorithms are not only theorized but also simulated and validated using real thermal imaging data mapping a ship's engine.
