Table of Contents
Fetching ...

Fast Redescription Mining Using Locality-Sensitive Hashing

Maiju Karjalainen, Esther Galbrun, Pauli Miettinen

TL;DR

New algorithms that perform the matching and extension orders of magnitude faster than the existing approaches to redescription mining are presented, based on locality-sensitive hashing with a tailored approach to handle the discretisation of numerical attributes.

Abstract

Redescription mining is a data analysis technique that has found applications in diverse fields. The most used redescription mining approaches involve two phases: finding matching pairs among data attributes and extending the pairs. This process is relatively efficient when the number of attributes remains limited and when the attributes are Boolean, but becomes almost intractable when the data consist of many numerical attributes. In this paper, we present new algorithms that perform the matching and extension orders of magnitude faster than the existing approaches. Our algorithms are based on locality-sensitive hashing with a tailored approach to handle the discretisation of numerical attributes as used in redescription mining.

Fast Redescription Mining Using Locality-Sensitive Hashing

TL;DR

New algorithms that perform the matching and extension orders of magnitude faster than the existing approaches to redescription mining are presented, based on locality-sensitive hashing with a tailored approach to handle the discretisation of numerical attributes.

Abstract

Redescription mining is a data analysis technique that has found applications in diverse fields. The most used redescription mining approaches involve two phases: finding matching pairs among data attributes and extending the pairs. This process is relatively efficient when the number of attributes remains limited and when the attributes are Boolean, but becomes almost intractable when the data consist of many numerical attributes. In this paper, we present new algorithms that perform the matching and extension orders of magnitude faster than the existing approaches. Our algorithms are based on locality-sensitive hashing with a tailored approach to handle the discretisation of numerical attributes as used in redescription mining.
Paper Structure (27 sections, 1 equation, 4 figures, 7 tables, 3 algorithms)

This paper contains 27 sections, 1 equation, 4 figures, 7 tables, 3 algorithms.

Figures (4)

  • Figure 1: Left: Running times on the DentalW dataset for finding initial pairs (blue) and extending pairs (yellow) using (A) the proposed algorithm ( Fier ), (B) Fier for initial pairs and ReReMi for extensions, (C) ReReMi with pre-bucketing (ReReMiBkt) and (D) standard ReReMi. The number within each bar indicates how many initial pairs were found. Right: Example redescription.
  • Figure 2: Comparing the accuracy of pairs found by ReReMiBkt and ReReMi . Each dot represents a pair of columns, and its location indicates the highest-accuracy initial pair ReReMiBkt and ReReMi .
  • Figure 3: Comparing the accuracy of pairs found by Fier and ReReMiBkt . Each dot represents a pair of columns, and its location indicates the highest-accuracy initial pair Fier and ReReMiBkt found. The light blue line at the bottom shows the density of the dots that lie along the $x$-axis. All axes range across the unit interval.
  • Figure 4: Comparing the accuracy of once extended initial pairs by Fier and ReReMi . Each dot represents a pair of columns, and its location indicates the highest-accuracy initial pair Fier and ReReMi found. All axes range across the unit interval.