New probabilistic interest measures for association rules
Michael Hahsler, Kurt Hornik
TL;DR
This paper addresses the problem that commonly used association-rule measures like confidence and lift can be biased by item frequencies and by random co-occurrences. It introduces a simple probabilistic framework based on independent Poisson processes to model transaction data as a null model, and from this derives two new measures, hyper-lift and hyper-confidence, which compare observed co-occurrences against appropriate hyper-geometric expectations or quantiles. Empirical results on real market-basket data (Grocery) and three additional datasets (T10I4D100K, Kosarak) show that hyper-lift and hyper-confidence more effectively filter or rank rules, reducing spurious findings compared with lift and confidence, and aligning with a Fisher-one-sided-test interpretation. The work provides a principled, statistically grounded approach to rule evaluation, with practical implications for more robust rule mining and a pathway to more sophisticated domain-specific null models.
Abstract
Mining association rules is an important technique for discovering meaningful patterns in transaction databases. Many different measures of interestingness have been proposed for association rules. However, these measures fail to take the probabilistic properties of the mined data into account. In this paper, we start with presenting a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. We use such data and a real-world database from a grocery outlet to explore the behavior of confidence and lift, two popular interest measures used for rule mining. The results show that confidence is systematically influenced by the frequency of the items in the left hand side of rules and that lift performs poorly to filter random noise in transaction data. Based on the probabilistic framework we develop two new interest measures, hyper-lift and hyper-confidence, which can be used to filter or order mined association rules. The new measures show significantly better performance than lift for applications where spurious rules are problematic.
