Table of Contents
Fetching ...

Fast Factorized Learning: Powered by In-Memory Database Systems

Bernhard Stöckl, Maximilian E. Schüle

TL;DR

The paper tackles redundant computation during training over join-derived data by adopting factorized joins and precomputing cofactors. It implements in-database factorized learning for linear regression and compares disk-based PostgreSQL with in-memory HyPer, showing substantial speedups on memory-resident engines. Key contributions include an open-source implementation, a cofactor-based gradient descent formulation, and a detailed evaluation on the Favorita dataset demonstrating notable performance gains in HyPer. The approach highlights how modern in-memory DBs can accelerate the ML pipeline by precomputing aggregates prior to data extraction, with potential extensions to broader models like polynomial regression.

Abstract

Learning models over factorized joins avoids redundant computations by identifying and pre-computing shared cofactors. Previous work has investigated the performance gain when computing cofactors on traditional disk-based database systems. Due to the absence of published code, the experiments could not be reproduced on in-memory database systems. This work describes the implementation when using cofactors for in-database factorized learning. We benchmark our open-source implementation for learning linear regression on factorized joins with PostgreSQL -- as a disk-based database system -- and HyPer -- as an in-memory engine. The evaluation shows a performance gain of factorized learning on in-memory database systems by 70\% to non-factorized learning and by a factor of 100 compared to disk-based database systems. Thus, modern database engines can contribute to the machine learning pipeline by pre-computing aggregates prior to data extraction to accelerate training.

Fast Factorized Learning: Powered by In-Memory Database Systems

TL;DR

The paper tackles redundant computation during training over join-derived data by adopting factorized joins and precomputing cofactors. It implements in-database factorized learning for linear regression and compares disk-based PostgreSQL with in-memory HyPer, showing substantial speedups on memory-resident engines. Key contributions include an open-source implementation, a cofactor-based gradient descent formulation, and a detailed evaluation on the Favorita dataset demonstrating notable performance gains in HyPer. The approach highlights how modern in-memory DBs can accelerate the ML pipeline by precomputing aggregates prior to data extraction, with potential extensions to broader models like polynomial regression.

Abstract

Learning models over factorized joins avoids redundant computations by identifying and pre-computing shared cofactors. Previous work has investigated the performance gain when computing cofactors on traditional disk-based database systems. Due to the absence of published code, the experiments could not be reproduced on in-memory database systems. This work describes the implementation when using cofactors for in-database factorized learning. We benchmark our open-source implementation for learning linear regression on factorized joins with PostgreSQL -- as a disk-based database system -- and HyPer -- as an in-memory engine. The evaluation shows a performance gain of factorized learning on in-memory database systems by 70\% to non-factorized learning and by a factor of 100 compared to disk-based database systems. Thus, modern database engines can contribute to the machine learning pipeline by pre-computing aggregates prior to data extraction to accelerate training.

Paper Structure

This paper contains 21 sections, 26 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: $(a)$ Relations: Sales(Product, Sale), Inventory(Location, Product, Inventory), Competition(Location, Competitor), $(b)$ Hypergraph of the natural join, $(c)$ Variable order of the natural join, $(d)$ Factorized join over the given schema (taken from factor)
  • Figure 2: Counting all elements from join in \ref{['fig:sale']}
  • Figure 3: Computing $SUM(Sale\cdot Competitor)$
  • Figure 4: Gradient descent with two variables of varying scales
  • Figure 5: UML diagram showing the structure of the class $ExtendedVariableOrder$, its inner class $nameKey$ and the struct $scaleFactors$
  • ...and 4 more figures