Table of Contents
Fetching ...

Improved Algorithms for Clustering with Noisy Distance Oracles

Pinki Pradhan, Anup Bhattacharya, Ragesh Jaiswal

TL;DR

The main contributions of this work are to show that thek-means++ algorithm can be adapted to work in the weak-strong oracle model using only a small number of strong-oracle queries, which is the critical resource in this model.

Abstract

Bateni et al. has recently introduced the weak-strong distance oracle model to study clustering problems in settings with limited distance information. Given query access to the strong-oracle and weak-oracle in the weak-strong oracle model, the authors design approximation algorithms for $k$-means and $k$-center clustering problems. In this work, we design algorithms with improved guarantees for $k$-means and $k$-center clustering problems in the weak-strong oracle model. The $k$-means++ algorithm is routinely used to solve $k$-means in settings where complete distance information is available. One of the main contributions of this work is to show that $k$-means++ algorithm can be adapted to work in the weak-strong oracle model using only a small number of strong-oracle queries, which is the critical resource in this model. In particular, our $k$-means++ based algorithm gives a constant approximation for $k$-means and uses $O(k^2 \log^2{n})$ strong-oracle queries. This improves on the algorithm of Bateni et al. that uses $O(k^2 \log^4n \log^2 \log n)$ strong-oracle queries for a constant factor approximation of $k$-means. For the $k$-center problem, we give a simple ball-carving based $6(1 + ε)$-approximation algorithm that uses $O(k^3 \log^2{n} \log{\frac{\log{n}}ε})$ strong-oracle queries. This is an improvement over the $14(1 + ε)$-approximation algorithm of Bateni et al. that uses $O(k^2 \log^4{n} \log^2{\frac{\log{n}}ε})$ strong-oracle queries. To show the effectiveness of our algorithms, we perform empirical evaluations on real-world datasets and show that our algorithms significantly outperform the algorithms of Bateni et al.

Improved Algorithms for Clustering with Noisy Distance Oracles

TL;DR

The main contributions of this work are to show that thek-means++ algorithm can be adapted to work in the weak-strong oracle model using only a small number of strong-oracle queries, which is the critical resource in this model.

Abstract

Bateni et al. has recently introduced the weak-strong distance oracle model to study clustering problems in settings with limited distance information. Given query access to the strong-oracle and weak-oracle in the weak-strong oracle model, the authors design approximation algorithms for -means and -center clustering problems. In this work, we design algorithms with improved guarantees for -means and -center clustering problems in the weak-strong oracle model. The -means++ algorithm is routinely used to solve -means in settings where complete distance information is available. One of the main contributions of this work is to show that -means++ algorithm can be adapted to work in the weak-strong oracle model using only a small number of strong-oracle queries, which is the critical resource in this model. In particular, our -means++ based algorithm gives a constant approximation for -means and uses strong-oracle queries. This improves on the algorithm of Bateni et al. that uses strong-oracle queries for a constant factor approximation of -means. For the -center problem, we give a simple ball-carving based -approximation algorithm that uses strong-oracle queries. This is an improvement over the -approximation algorithm of Bateni et al. that uses strong-oracle queries. To show the effectiveness of our algorithms, we perform empirical evaluations on real-world datasets and show that our algorithms significantly outperform the algorithms of Bateni et al.
Paper Structure (39 sections, 28 theorems, 12 equations, 10 figures, 8 tables, 5 algorithms)

This paper contains 39 sections, 28 theorems, 12 equations, 10 figures, 8 tables, 5 algorithms.

Key Result

Theorem 1.1

(Upper bound for $k$-means) Let $\epsilon \in (0,1)$ and $\delta \leq 1/3$. There exists a randomized algorithm for the $k$-means problem that adapts oversampling $k$-means++ in the weak–strong oracle model and outputs a $(O(\frac{ \log n}{\epsilon^4}), 40(1+\epsilon))$ bi-criteria approximation for

Figures (10)

  • Figure 1: $k$-means on SBM dataset with $n=10k$
  • Figure 2: $k$-means on SBM dataset with $n=20k$
  • Figure 3: $k$-means on SBM dataset with $n=50k$
  • Figure 4: $k$-means on MNIST with SVD embedding $n = 60k$
  • Figure 5: $k$-means on MNIST with t-SNE embedding $n = 60k$
  • ...and 5 more figures

Theorems & Definitions (59)

  • Definition 1.1: Strong-oracle model
  • Definition 1.2: Weak-oracle model
  • Theorem 1.1
  • Theorem 1.2
  • Remark 1.1
  • Theorem 1.3
  • Theorem 2.1
  • Definition 2.1
  • Lemma 2.1
  • Definition 2.2
  • ...and 49 more