Fast Redescription Mining Using Locality-Sensitive Hashing

Maiju Karjalainen; Esther Galbrun; Pauli Miettinen

Fast Redescription Mining Using Locality-Sensitive Hashing

Maiju Karjalainen, Esther Galbrun, Pauli Miettinen

TL;DR

New algorithms that perform the matching and extension orders of magnitude faster than the existing approaches to redescription mining are presented, based on locality-sensitive hashing with a tailored approach to handle the discretisation of numerical attributes.

Abstract

Redescription mining is a data analysis technique that has found applications in diverse fields. The most used redescription mining approaches involve two phases: finding matching pairs among data attributes and extending the pairs. This process is relatively efficient when the number of attributes remains limited and when the attributes are Boolean, but becomes almost intractable when the data consist of many numerical attributes. In this paper, we present new algorithms that perform the matching and extension orders of magnitude faster than the existing approaches. Our algorithms are based on locality-sensitive hashing with a tailored approach to handle the discretisation of numerical attributes as used in redescription mining.

Fast Redescription Mining Using Locality-Sensitive Hashing

TL;DR

Abstract

Paper Structure (27 sections, 1 equation, 4 figures, 7 tables, 3 algorithms)

This paper contains 27 sections, 1 equation, 4 figures, 7 tables, 3 algorithms.

Introduction
Redescription mining.
Related work.
The Algorithm
The ReReMi algorithm
Primer on LSH
Finding Initial Pairs
Extending Initial Pairs
Computing signatures for literals.
The target vector.
Extending redescriptions.
Time Complexity
Experimental Evaluation
Experimental Setup
Finding Initial Pairs
...and 12 more sections

Figures (4)

Figure 1: Left: Running times on the DentalW dataset for finding initial pairs (blue) and extending pairs (yellow) using (A) the proposed algorithm ( Fier ), (B) Fier for initial pairs and ReReMi for extensions, (C) ReReMi with pre-bucketing (ReReMiBkt) and (D) standard ReReMi. The number within each bar indicates how many initial pairs were found. Right: Example redescription.
Figure 2: Comparing the accuracy of pairs found by ReReMiBkt and ReReMi . Each dot represents a pair of columns, and its location indicates the highest-accuracy initial pair ReReMiBkt and ReReMi .
Figure 3: Comparing the accuracy of pairs found by Fier and ReReMiBkt . Each dot represents a pair of columns, and its location indicates the highest-accuracy initial pair Fier and ReReMiBkt found. The light blue line at the bottom shows the density of the dots that lie along the $x$-axis. All axes range across the unit interval.
Figure 4: Comparing the accuracy of once extended initial pairs by Fier and ReReMi . Each dot represents a pair of columns, and its location indicates the highest-accuracy initial pair Fier and ReReMi found. All axes range across the unit interval.

Fast Redescription Mining Using Locality-Sensitive Hashing

TL;DR

Abstract

Fast Redescription Mining Using Locality-Sensitive Hashing

Authors

TL;DR

Abstract

Table of Contents

Figures (4)