Fast Factorized Learning: Powered by In-Memory Database Systems
Bernhard Stöckl, Maximilian E. Schüle
TL;DR
The paper tackles redundant computation during training over join-derived data by adopting factorized joins and precomputing cofactors. It implements in-database factorized learning for linear regression and compares disk-based PostgreSQL with in-memory HyPer, showing substantial speedups on memory-resident engines. Key contributions include an open-source implementation, a cofactor-based gradient descent formulation, and a detailed evaluation on the Favorita dataset demonstrating notable performance gains in HyPer. The approach highlights how modern in-memory DBs can accelerate the ML pipeline by precomputing aggregates prior to data extraction, with potential extensions to broader models like polynomial regression.
Abstract
Learning models over factorized joins avoids redundant computations by identifying and pre-computing shared cofactors. Previous work has investigated the performance gain when computing cofactors on traditional disk-based database systems. Due to the absence of published code, the experiments could not be reproduced on in-memory database systems. This work describes the implementation when using cofactors for in-database factorized learning. We benchmark our open-source implementation for learning linear regression on factorized joins with PostgreSQL -- as a disk-based database system -- and HyPer -- as an in-memory engine. The evaluation shows a performance gain of factorized learning on in-memory database systems by 70\% to non-factorized learning and by a factor of 100 compared to disk-based database systems. Thus, modern database engines can contribute to the machine learning pipeline by pre-computing aggregates prior to data extraction to accelerate training.
