Improved Algorithms for Clustering with Noisy Distance Oracles

Pinki Pradhan; Anup Bhattacharya; Ragesh Jaiswal

Improved Algorithms for Clustering with Noisy Distance Oracles

Pinki Pradhan, Anup Bhattacharya, Ragesh Jaiswal

TL;DR

The main contributions of this work are to show that thek-means++ algorithm can be adapted to work in the weak-strong oracle model using only a small number of strong-oracle queries, which is the critical resource in this model.

Abstract

Bateni et al. has recently introduced the weak-strong distance oracle model to study clustering problems in settings with limited distance information. Given query access to the strong-oracle and weak-oracle in the weak-strong oracle model, the authors design approximation algorithms for $k$-means and $k$-center clustering problems. In this work, we design algorithms with improved guarantees for $k$-means and $k$-center clustering problems in the weak-strong oracle model. The $k$-means++ algorithm is routinely used to solve $k$-means in settings where complete distance information is available. One of the main contributions of this work is to show that $k$-means++ algorithm can be adapted to work in the weak-strong oracle model using only a small number of strong-oracle queries, which is the critical resource in this model. In particular, our $k$-means++ based algorithm gives a constant approximation for $k$-means and uses $O(k^2 \log^2{n})$ strong-oracle queries. This improves on the algorithm of Bateni et al. that uses $O(k^2 \log^4n \log^2 \log n)$ strong-oracle queries for a constant factor approximation of $k$-means. For the $k$-center problem, we give a simple ball-carving based $6(1 + ε)$-approximation algorithm that uses $O(k^3 \log^2{n} \log{\frac{\log{n}}ε})$ strong-oracle queries. This is an improvement over the $14(1 + ε)$-approximation algorithm of Bateni et al. that uses $O(k^2 \log^4{n} \log^2{\frac{\log{n}}ε})$ strong-oracle queries. To show the effectiveness of our algorithms, we perform empirical evaluations on real-world datasets and show that our algorithms significantly outperform the algorithms of Bateni et al.

Improved Algorithms for Clustering with Noisy Distance Oracles

TL;DR

Abstract

-means and

-center clustering problems. In this work, we design algorithms with improved guarantees for

-means and

-center clustering problems in the weak-strong oracle model. The

-means++ algorithm is routinely used to solve

-means in settings where complete distance information is available. One of the main contributions of this work is to show that

-means++ algorithm can be adapted to work in the weak-strong oracle model using only a small number of strong-oracle queries, which is the critical resource in this model. In particular, our

-means++ based algorithm gives a constant approximation for

-means and uses

strong-oracle queries. This improves on the algorithm of Bateni et al. that uses

strong-oracle queries for a constant factor approximation of

-means. For the

-center problem, we give a simple ball-carving based

-approximation algorithm that uses

strong-oracle queries. This is an improvement over the

-approximation algorithm of Bateni et al. that uses

strong-oracle queries. To show the effectiveness of our algorithms, we perform empirical evaluations on real-world datasets and show that our algorithms significantly outperform the algorithms of Bateni et al.

Paper Structure (39 sections, 28 theorems, 12 equations, 10 figures, 8 tables, 5 algorithms)

This paper contains 39 sections, 28 theorems, 12 equations, 10 figures, 8 tables, 5 algorithms.

INTRODUCTION
Clustering Problems
Main Results
Weak-strong oracle model:
Weak-oracle model:
Our contributions and comparisons with known works
Experiments
Technical Overview
Related Works
ALGORITHM FOR $k$-MEANS IN WEAK-STRONG ORACLE MODEL
Estimating $d^{est}_{km}(x,C)$ in weak-strong oracle model
ALGORITHM FOR $k$-CENTER IN WEAK-STRONG ORACLE MODEL
EXPERIMENTAL RESULTS
Datasets
Construction of the weak-oracle
...and 24 more sections

Key Result

Theorem 1.1

(Upper bound for $k$-means) Let $\epsilon \in (0,1)$ and $\delta \leq 1/3$. There exists a randomized algorithm for the $k$-means problem that adapts oversampling $k$-means++ in the weak–strong oracle model and outputs a $(O(\frac{ \log n}{\epsilon^4}), 40(1+\epsilon))$ bi-criteria approximation for

Figures (10)

Figure 1: $k$-means on SBM dataset with $n=10k$
Figure 2: $k$-means on SBM dataset with $n=20k$
Figure 3: $k$-means on SBM dataset with $n=50k$
Figure 4: $k$-means on MNIST with SVD embedding $n = 60k$
Figure 5: $k$-means on MNIST with t-SNE embedding $n = 60k$
...and 5 more figures

Theorems & Definitions (59)

Definition 1.1: Strong-oracle model
Definition 1.2: Weak-oracle model
Theorem 1.1
Theorem 1.2
Remark 1.1
Theorem 1.3
Theorem 2.1
Definition 2.1
Lemma 2.1
Definition 2.2
...and 49 more

Improved Algorithms for Clustering with Noisy Distance Oracles

TL;DR

Abstract

Improved Algorithms for Clustering with Noisy Distance Oracles

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (59)