Table of Contents
Fetching ...

Discovering Association Rules in High-Dimensional Small Tabular Data

Erkan Karabulut, Daniel Daza, Paul Groth, Victoria Degeler

TL;DR

This work addresses association rule mining in high-dimensional, small-sample tabular data by showing that the neurosymbolic method Aerial+ scales markedly better than traditional ARM algorithms on $d \gg n$ datasets. It introduces the problem of ARM under high dimensionality with very limited data, exemplified by gene expression data, and proposes two fine-tuning strategies that leverage tabular foundation models (TabPFN) to improve rule quality in low-data regimes. The proposed methods—Aerial+WI and Aerial+DL—consistently enhance rule confidence and Zhang’s metric while reducing the total number of rules, with only modest increases in runtime. The results suggest a promising direction for integrating pretrained tabular representations into neurosymbolic ARM to achieve scalable and interpretable knowledge discovery in challenging real-world domains.

Abstract

Association Rule Mining (ARM) aims to discover patterns between features in datasets in the form of propositional rules, supporting both knowledge discovery and interpretable machine learning in high-stakes decision-making. However, in high-dimensional settings, rule explosion and computational overhead render popular algorithmic approaches impractical without effective search space reduction, challenges that propagate to downstream tasks. Neurosymbolic methods, such as Aerial+, have recently been proposed to address the rule explosion in ARM. While they tackle the high dimensionality of the data, they also inherit limitations of neural networks, particularly reduced performance in low-data regimes. This paper makes three key contributions to association rule discovery in high-dimensional tabular data. First, we empirically show that Aerial+ scales one to two orders of magnitude better than state-of-the-art algorithmic and neurosymbolic baselines across five real-world datasets. Second, we introduce the novel problem of ARM in high-dimensional, low-data settings, such as gene expression data from the biomedicine domain with around 18k features and 50 samples. Third, we propose two fine-tuning approaches to Aerial+ using tabular foundation models. Our proposed approaches are shown to significantly improve rule quality on five real-world datasets, demonstrating their effectiveness in low-data, high-dimensional scenarios.

Discovering Association Rules in High-Dimensional Small Tabular Data

TL;DR

This work addresses association rule mining in high-dimensional, small-sample tabular data by showing that the neurosymbolic method Aerial+ scales markedly better than traditional ARM algorithms on datasets. It introduces the problem of ARM under high dimensionality with very limited data, exemplified by gene expression data, and proposes two fine-tuning strategies that leverage tabular foundation models (TabPFN) to improve rule quality in low-data regimes. The proposed methods—Aerial+WI and Aerial+DL—consistently enhance rule confidence and Zhang’s metric while reducing the total number of rules, with only modest increases in runtime. The results suggest a promising direction for integrating pretrained tabular representations into neurosymbolic ARM to achieve scalable and interpretable knowledge discovery in challenging real-world domains.

Abstract

Association Rule Mining (ARM) aims to discover patterns between features in datasets in the form of propositional rules, supporting both knowledge discovery and interpretable machine learning in high-stakes decision-making. However, in high-dimensional settings, rule explosion and computational overhead render popular algorithmic approaches impractical without effective search space reduction, challenges that propagate to downstream tasks. Neurosymbolic methods, such as Aerial+, have recently been proposed to address the rule explosion in ARM. While they tackle the high dimensionality of the data, they also inherit limitations of neural networks, particularly reduced performance in low-data regimes. This paper makes three key contributions to association rule discovery in high-dimensional tabular data. First, we empirically show that Aerial+ scales one to two orders of magnitude better than state-of-the-art algorithmic and neurosymbolic baselines across five real-world datasets. Second, we introduce the novel problem of ARM in high-dimensional, low-data settings, such as gene expression data from the biomedicine domain with around 18k features and 50 samples. Third, we propose two fine-tuning approaches to Aerial+ using tabular foundation models. Our proposed approaches are shown to significantly improve rule quality on five real-world datasets, demonstrating their effectiveness in low-data, high-dimensional scenarios.

Paper Structure

This paper contains 9 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Aerial+ aerial_plus ARM pipeline consists of: i) converting given categorical tabular data into transactions by one-hot encoding, ii) vectorizing the one-hot encoded data, iii) training an under-complete denoising Autoencoder with a reconstruction loss and to output probability distributions per column, iv) and extracts association rules by exploiting the reconstruction ability of autoencoders, based given probabilistic antecedent and consequent similarity thresholds.
  • Figure 2: Boxology van2021modular diagram of neurosymbolic ARM approaches such as Aerial+: i) a neural model of data (i.e., tabular data) is learned, ii) an algorithm (symbolic) infers rules (symbols) from the model using hypotheses (symbols, as in test vectors of Aerial+).
  • Figure 3: Scalability on high-dimensional tabular data. Execution times of algorithmic and neurosymbolic (including training and rule extraction time) ARM approaches in seconds on a logarithmic scale, as the number of columns increases gradually. Aerial+ has one to two orders of magnitude better scalability on high-dimensional datasets compared to other methods. Lower performance of Aerial+ with a smaller number of columns is due to the training procedure, which implies that algorithmic methods are faster on lower-dimensional (columns) tables.
  • Figure 4: Weight initialization (Aerial+WI, Left): tabular data is embedded using a foundation model, then a projection encoder is trained to align these embeddings with pre-processed Aerial+ input. The learned projection encoder is used to initialize the first-layer weights and biases of Aerial+'s encoder, providing a semantically meaningful starting point for fine-tuning. Double loss (Aerial+DL, Right): tabular embeddings are aligned with reconstructed Aerial+ outputs using a projection encoder, and this alignment objective is incorporated into the Aerial+ autoencoder reconstruction loss. This double loss encourages the autoencoder to produce reconstructions semantically consistent with the original table embeddings, supporting accurate fine-tuning.