Table of Contents
Fetching ...

Predictive Query Language: A Domain-Specific Language for Predictive Modeling on Relational Databases

Vid Kocijan, Jinu Sunil, Jan Eric Lenssen, Viman Deb, Xinwei Xe, Federco Reyes Gomez, Matthias Fey, Jure Leskovec

TL;DR

Predictive Query Language (PQL) introduces a SQL-inspired declarative language to define predictive tasks directly on relational databases and automatically generate training tables. It unifies static and temporal data handling, enforces leakage-free data construction, and supports automatic task inference for regression, classification, forecasting, and link prediction. The paper presents two scalable implementations: batch Relational Deep Learning (RDL) and low-latency Relational Foundation Model (RFM), demonstrating substantial speedups (up to 40x) and applicability to real-world domains such as recommendations, fraud detection, and healthcare. By enabling concise, reproducible training-data generation and compatible with existing ML workflows, PQL accelerates model development and enables scalable, interactive predictive analytics on large relational datasets.

Abstract

The purpose of predictive modeling on relational data is to predict future or missing values in a relational database, for example, future purchases of a user, risk of readmission of the patient, or the likelihood that a financial transaction is fraudulent. Typically powered by machine learning methods, predictive models are used in recommendations, financial fraud detection, supply chain optimization, and other systems, providing billions of predictions every day. However, training a machine learning model requires manual work to extract the required training examples - prediction entities and target labels - from the database, which is slow, laborious, and prone to mistakes. Here, we present the Predictive Query Language (PQL), a SQL-inspired declarative language for defining predictive tasks on relational databases. PQL allows specifying a predictive task in a single declarative query, enabling the automatic computation training labels for a large variety of machine learning tasks, such as regression, classification, time-series forecasting, and recommender systems. PQL is already successfully integrated and used in a collection of use cases as part of a predictive AI platform. The versatility of the language can be demonstrated through its many ongoing use cases, including financial fraud, item recommendations, and workload prediction. We demonstrate its versatile design through two implementations; one for small-scale, low-latency use and one that can handle large-scale databases.

Predictive Query Language: A Domain-Specific Language for Predictive Modeling on Relational Databases

TL;DR

Predictive Query Language (PQL) introduces a SQL-inspired declarative language to define predictive tasks directly on relational databases and automatically generate training tables. It unifies static and temporal data handling, enforces leakage-free data construction, and supports automatic task inference for regression, classification, forecasting, and link prediction. The paper presents two scalable implementations: batch Relational Deep Learning (RDL) and low-latency Relational Foundation Model (RFM), demonstrating substantial speedups (up to 40x) and applicability to real-world domains such as recommendations, fraud detection, and healthcare. By enabling concise, reproducible training-data generation and compatible with existing ML workflows, PQL accelerates model development and enables scalable, interactive predictive analytics on large relational datasets.

Abstract

The purpose of predictive modeling on relational data is to predict future or missing values in a relational database, for example, future purchases of a user, risk of readmission of the patient, or the likelihood that a financial transaction is fraudulent. Typically powered by machine learning methods, predictive models are used in recommendations, financial fraud detection, supply chain optimization, and other systems, providing billions of predictions every day. However, training a machine learning model requires manual work to extract the required training examples - prediction entities and target labels - from the database, which is slow, laborious, and prone to mistakes. Here, we present the Predictive Query Language (PQL), a SQL-inspired declarative language for defining predictive tasks on relational databases. PQL allows specifying a predictive task in a single declarative query, enabling the automatic computation training labels for a large variety of machine learning tasks, such as regression, classification, time-series forecasting, and recommender systems. PQL is already successfully integrated and used in a collection of use cases as part of a predictive AI platform. The versatility of the language can be demonstrated through its many ongoing use cases, including financial fraud, item recommendations, and workload prediction. We demonstrate its versatile design through two implementations; one for small-scale, low-latency use and one that can handle large-scale databases.
Paper Structure (17 sections, 15 figures, 1 table)

This paper contains 17 sections, 15 figures, 1 table.

Figures (15)

  • Figure 1: Predictive Query Overview. PQL allows for declaring a prediction task by formulating a query: The PREDICT statement describes what is to be predicted and the FOR EACH and WHERE statements for whom the prediction is to be made. In this example, the query is specifying a regression task of predicting the sum of transaction amounts over the next 30 days for all customers coming from New York. The predictive query engine then parses the query and generates a training table with labels from the database. The entries can be used to train a machine learning model or as examples for in-context learning in relational foundation models.
  • Figure 2: Taxonomy of the Predictive Query Language (PQL). Examples of predictive queries written in the proposed PQL. The language allows declaration of a wide range of prediction tasks, such as regression, classification, forecasting, and link prediction. Typical applications can be LTV prediction or forecasting, missing value imputation, churn prediction, or item recommendation for customers.
  • Figure 3: Predictive Query Example. An instance of a query to predict demand for shirts over the next three months.
  • Figure 4: Example database schema. An example retail database consisting of all customers, articles, and their timestamped transactions and notifications, serving as basis for the running query examples. Each row contains information about the data type and semantic type, with links between keys marked with arrows. Primary keys (PK) are highlighted in blue and foreign keys (FK) in orange. If PQL example from Figure \ref{['example_query']} is used on this query, the training table (violet) is added to the database, with three columns: Entity, pointing to the entity of the query, Target, denoting the label---count of the transactions, and Timestamp, denoting the anchor time from which the label was computed.
  • Figure 5: Example of a static predictive query. The query defines each TRANSCTION_ID to be an entity, and its corresponding VALUE to be the target. In this query there is always exactly one target per entity.
  • ...and 10 more figures