Table of Contents
Fetching ...

Accelerating spherical K-means clustering for large-scale sparse document data

Kazuo Aoyama, Kazumi Saito

TL;DR

An algorithm working in an architecture-friendly manner (AFM), which is a way of suppressing performance-degradation factors such as the numbers of instructions, branch mispredictions, and cache misses in CPUs of a computer system, is designed.

Abstract

This paper presents an accelerated spherical K-means clustering algorithm for large-scale and high-dimensional sparse document data sets. We design an algorithm working in an architecture-friendly manner (AFM), which is a procedure of suppressing performance-degradation factors such as the numbers of instructions, branch mispredictions, and cache misses in CPUs of a modern computer system. For the AFM operation, we leverage unique universal characteristics (UCs) of a data-object and a cluster's mean set, which are skewed distributions on data relationships such as Zipf's law and a feature-value concentration phenomenon. The UCs indicate that the most part of the number of multiplications for similarity calculations is executed regarding terms with high document frequencies (df) and the most part of a similarity between an object- and a mean-feature vector is obtained by the multiplications regarding a few high mean-feature values. Our proposed algorithm applies an inverted-index data structure to a mean set, extracts the specific region with high-df terms and high mean-feature values in the mean-inverted index by newly introduced two structural parameters, and exploits the index divided into three parts for efficient pruning. The algorithm determines the two structural parameters by minimizing the approximate number of multiplications related to that of instructions, reduces the branch mispredictions by sharing the index structure including the two parameters with all the objects, and suppressing the cache misses by keeping in the caches the frequently used data in the foregoing specific region, resulting in working in the AFM. We experimentally demonstrate that our algorithm efficiently achieves superior speed performance in large-scale documents compared with algorithms using the state-of-the-art techniques.

Accelerating spherical K-means clustering for large-scale sparse document data

TL;DR

An algorithm working in an architecture-friendly manner (AFM), which is a way of suppressing performance-degradation factors such as the numbers of instructions, branch mispredictions, and cache misses in CPUs of a computer system, is designed.

Abstract

This paper presents an accelerated spherical K-means clustering algorithm for large-scale and high-dimensional sparse document data sets. We design an algorithm working in an architecture-friendly manner (AFM), which is a procedure of suppressing performance-degradation factors such as the numbers of instructions, branch mispredictions, and cache misses in CPUs of a modern computer system. For the AFM operation, we leverage unique universal characteristics (UCs) of a data-object and a cluster's mean set, which are skewed distributions on data relationships such as Zipf's law and a feature-value concentration phenomenon. The UCs indicate that the most part of the number of multiplications for similarity calculations is executed regarding terms with high document frequencies (df) and the most part of a similarity between an object- and a mean-feature vector is obtained by the multiplications regarding a few high mean-feature values. Our proposed algorithm applies an inverted-index data structure to a mean set, extracts the specific region with high-df terms and high mean-feature values in the mean-inverted index by newly introduced two structural parameters, and exploits the index divided into three parts for efficient pruning. The algorithm determines the two structural parameters by minimizing the approximate number of multiplications related to that of instructions, reduces the branch mispredictions by sharing the index structure including the two parameters with all the objects, and suppressing the cache misses by keeping in the caches the frequently used data in the foregoing specific region, resulting in working in the AFM. We experimentally demonstrate that our algorithm efficiently achieves superior speed performance in large-scale documents compared with algorithms using the state-of-the-art techniques.

Paper Structure

This paper contains 35 sections, 46 equations, 23 figures, 22 tables, 11 algorithms.

Figures (23)

  • Figure 1: Performance comparison of MIVI, DIVI, and Ding$^+$ in 8.2M-sized PubMed data set with $K$=80 000: (a) Number of multiplications and (b) Elapsed time along iterations.
  • Figure 2: Characteristics of 8.2M-sized PubMed data set: (a) Zipf's law on term frequency ( tf) and document frequency ( df) and (b) Bounded Zipf's law on mean frequency ( mf) with four $K$ values.
  • Figure 3: Characteristics of 8.2M-sized PubMed data set: (a) df-$\overline{\hbox{\em mf}}$ scatter plot in log-log scale and (b) Diagram of number of multiplications when MIVI is executed, which corresponds to volume surrounded by curves inside the rectangle.
  • Figure 4: (a) Skewed form of mean-feature values and (b) cumulative partial similarity (CPS) against normalized rank. Both were built from 8.2M-sized PubMed data set and $K$=80 000 was used for (b).
  • Figure 5: Diagram of three regions illustrated on plane of term ID and mean-inverted index. Right figure represents $s$th mean-inverted-index array composed of tuples $\bm{(c_{(s,q)},v_{c_{(s,q)}})}$.
  • ...and 18 more figures