Table of Contents
Fetching ...

Binary $k$-Center with Missing Entries: Structure Leads to Tractability

Farehe Soheil, Kirill Simonov, Tobias Friedrich

TL;DR

This paper studies k-Center with Missing Entries for binary data, where distances ignore unknown coordinates and the structure of known entries is captured by the incidence graph $G_{m{M}}$. It establishes fixed-parameter tractability with respect to several natural structural parameters—vertex cover, fracture number, and treewidth of $G_{m{M}}$—and connects Closest String to ILP Feasibility, showing that improvements in one problem would translate to improvements in ILP solving. The authors provide concrete FPT algorithms: (i) by vertex cover, (ii) by treewidth (with radius parameter $d$), and (iii) by fracture number, each with explicit exponential dependencies on the respective parameter and polynomial in the input size. These results reveal that sparsity and structural constraints on missing data can dramatically reduce the hardness of clustering with missing entries and tie clustering to core ILP techniques, suggesting practical algorithmic avenues and deep theoretical connections. The work also situates Closest String and ILP within a tight equivalence framework, indicating that progress in fundamental covering/ILP problems would settle longstanding questions in related domains.

Abstract

$\kC$ clustering is a fundamental classification problem, where the task is to categorize the given collection of entities into $k$ clusters and come up with a representative for each cluster, so that the maximum distance between an entity and its representative is minimized. In this work, we focus on the setting where the entities are represented by binary vectors with missing entries, which model incomplete categorical data. This version of the problem has wide applications, from predictive analytics to bioinformatics. Our main finding is that the problem, which is notoriously hard from the classical complexity viewpoint, becomes tractable as soon as the known entries are sparse and exhibit a certain structure. Formally, we show fixed-parameter tractable algorithms for the parameters vertex cover, fracture number, and treewidth of the row-column graph, which encodes the positions of the known entries of the matrix. Additionally, we tie the complexity of the 1-cluster variant of the problem, which is famous under the name Closest String, to the complexity of solving integer linear programs with few constraints. This implies, in particular, that improving upon the running times of our algorithms would lead to more efficient algorithms for integer linear programming in general.

Binary $k$-Center with Missing Entries: Structure Leads to Tractability

TL;DR

This paper studies k-Center with Missing Entries for binary data, where distances ignore unknown coordinates and the structure of known entries is captured by the incidence graph . It establishes fixed-parameter tractability with respect to several natural structural parameters—vertex cover, fracture number, and treewidth of —and connects Closest String to ILP Feasibility, showing that improvements in one problem would translate to improvements in ILP solving. The authors provide concrete FPT algorithms: (i) by vertex cover, (ii) by treewidth (with radius parameter ), and (iii) by fracture number, each with explicit exponential dependencies on the respective parameter and polynomial in the input size. These results reveal that sparsity and structural constraints on missing data can dramatically reduce the hardness of clustering with missing entries and tie clustering to core ILP techniques, suggesting practical algorithmic avenues and deep theoretical connections. The work also situates Closest String and ILP within a tight equivalence framework, indicating that progress in fundamental covering/ILP problems would settle longstanding questions in related domains.

Abstract

clustering is a fundamental classification problem, where the task is to categorize the given collection of entities into clusters and come up with a representative for each cluster, so that the maximum distance between an entity and its representative is minimized. In this work, we focus on the setting where the entities are represented by binary vectors with missing entries, which model incomplete categorical data. This version of the problem has wide applications, from predictive analytics to bioinformatics. Our main finding is that the problem, which is notoriously hard from the classical complexity viewpoint, becomes tractable as soon as the known entries are sparse and exhibit a certain structure. Formally, we show fixed-parameter tractable algorithms for the parameters vertex cover, fracture number, and treewidth of the row-column graph, which encodes the positions of the known entries of the matrix. Additionally, we tie the complexity of the 1-cluster variant of the problem, which is famous under the name Closest String, to the complexity of solving integer linear programs with few constraints. This implies, in particular, that improving upon the running times of our algorithms would lead to more efficient algorithms for integer linear programming in general.

Paper Structure

This paper contains 8 sections, 19 theorems, 56 equations, 7 figures, 1 algorithm.

Key Result

Theorem 1.1

$k$-Center with Missing Entries admits an algorithm with running time

Figures (7)

  • Figure 1: On the left, the mask matrix $\bm{M}$ and on the right, its corresponding incidence graph. The row vertices, $R_{\bm{M}}$, are in gray and the column vertices $C_{\bm{M}}$ are in black.
  • Figure 2: On the left is matrix $\bm{M}$, with each row corresponding to a string in $S$. Columns $c_1$, $c_4$, and $c_7$ are represented in bold. The reordered input matrix $\bm{M}'$ in on the right. Strings of the same type, are grouped as blocks, with each block colored in gray. In each block, the rows are either consecutive ones or consecutive zeros. The first and second row restricted to the block of type $t_1$, represented in bold font, are "000"and "111" respectively
  • Figure 3: The target instance $(\bm{A}', \bm{b}')$ in the construction of Lemma \ref{['lemma:binary_to_plus_minus_one']}.
  • Figure 4: Construction of strings in the Closest String instance, Lemma \ref{['lemma:closest_string']}.
  • Figure 5: The mask matrix $\bm{M}$. The non-zero entries are distributed only within $R_{S}$ and $C_{S}$. For every $r \in \overline{R_{S}}$ and $c \in \overline{C_{S}}$ it holds that $\bm{M}[r][c]=0$.
  • ...and 2 more figures

Theorems & Definitions (36)

  • Theorem 1.1
  • Theorem 1.2
  • Theorem 1.3
  • Theorem 1.4
  • Definition 2.1: $k$-Center with Missing Entries
  • Definition 2.2: Closest String
  • Definition 2.3: Non-uniform Closest String
  • Definition 2.4: ILP Feasibility
  • Theorem 3.1
  • proof
  • ...and 26 more