SALT: Sales Autocompletion Linked Business Tables Dataset
Tassilo Klein, Clemens Biehl, Margarida Costa, Andre Sres, Jonas Kolk, Johannes Hoffart
TL;DR
This paper presents SALT, a novel, ERP-derived dataset of linked four-table sales data designed to advance table representation learning for enterprise contexts. By joining SalesDocument, SalesDocumentItem, Customer, and AddrOrgNamePostalAddress, SALT provides a 2,319,944-row flat dataset with 21 input features and 8 targets, along with temporally split validation and test sets under strong privacy anonymization. The authors evaluate multiple tabular baselines including CARTE, AutoGluon, XGBoost, LightGBM, CatBoost, and GraphSAGE, finding CARTE to yield the best performance with a modest gain over others, while highlighting data challenges such as target predictability variance, severe class imbalance, high-cardinality fields, and potential data drift over time. Overall, SALT demonstrates the value of realistic, linked enterprise data for advancing table representation learning and outlines plans to broaden the dataset with more tables and multi-company coverage, keeping privacy at the forefront.
Abstract
Foundation models, particularly those that incorporate Transformer architectures, have demonstrated exceptional performance in domains such as natural language processing and image processing. Adapting these models to structured data, like tables, however, introduces significant challenges. These difficulties are even more pronounced when addressing multi-table data linked via foreign key, which is prevalent in the enterprise realm and crucial for empowering business use cases. Despite its substantial impact, research focusing on such linked business tables within enterprise settings remains a significantly important yet underexplored domain. To address this, we introduce a curated dataset sourced from an Enterprise Resource Planning (ERP) system, featuring extensive linked tables. This dataset is specifically designed to support research endeavors in table representation learning. By providing access to authentic enterprise data, our goal is to potentially enhance the effectiveness and applicability of models for real-world business contexts.
