Maximizing Diversity in (near-)Median String Selection

Diptarka Chakraborty; Rudrayan Kundu; Nidhi Purohit; Aravinda Kanchana Ruwanpathirana

Maximizing Diversity in (near-)Median String Selection

Diptarka Chakraborty, Rudrayan Kundu, Nidhi Purohit, Aravinda Kanchana Ruwanpathirana

TL;DR

This work initiates a systematic study of generating diverse near-optimal Hamming medians. It provides an exact diameter-focused algorithm to yield two maximally diverse medians, and develops approximation schemes for sum dispersion, including a PTAS, as well as bi-criteria and LP-based approaches for min dispersion to produce multiple diverse near-medians. By exploiting the structure of the Hamming median space and linking to error-correcting codes, the results offer practical methods for robust, diverse consensus strings. The findings have potential applications in motifs discovery, prototype design, and robust decision-making where solution diversity matters.

Abstract

Given a set of strings over a specified alphabet, identifying a median or consensus string that minimizes the total distance to all input strings is a fundamental data aggregation problem. When the Hamming distance is considered as the underlying metric, this problem has extensive applications, ranging from bioinformatics to pattern recognition. However, modern applications often require the generation of multiple (near-)optimal yet diverse median strings to enhance flexibility and robustness in decision-making. In this study, we address this need by focusing on two prominent diversity measures: sum dispersion and min dispersion. We first introduce an exact algorithm for the diameter variant of the problem, which identifies pairs of near-optimal medians that are maximally diverse. Subsequently, we propose a $(1-ε)$-approximation algorithm (for any $ε>0$) for sum dispersion, as well as a bi-criteria approximation algorithm for the more challenging min dispersion case, allowing the generation of multiple (more than two) diverse near-optimal Hamming medians. Our approach primarily leverages structural insights into the Hamming median space and also draws on techniques from error-correcting code construction to establish these results.

Maximizing Diversity in (near-)Median String Selection

TL;DR

Abstract

-approximation algorithm (for any

) for sum dispersion, as well as a bi-criteria approximation algorithm for the more challenging min dispersion case, allowing the generation of multiple (more than two) diverse near-optimal Hamming medians. Our approach primarily leverages structural insights into the Hamming median space and also draws on techniques from error-correcting code construction to establish these results.

Paper Structure (15 sections, 35 theorems, 6 equations, 1 figure, 4 algorithms)

This paper contains 15 sections, 35 theorems, 6 equations, 1 figure, 4 algorithms.

Introduction
Preliminaries
Exact Algorithms for Diameter Maximization
Maximizing the Sum Dispersion
Maximizing the Minimum Dispersion
Discussion and Future Work
Missing Proofs from Preliminaries
Exact Algorithm for Diameter Maximization: Median Strings
Exact Algorithm for Diameter Maximization: (1+ϵ)-Approximate Medians
Finding a Partition with Smallest Sum Difference
Exact Algorithm for Sum Dispersion: k Hamming medians
A PTAS for Sum Dispersion: k Approximate Hamming medians
Approximation Algorithm for Min Dispersion: k Hamming Medians
Bi-criteria Approximation for Min Dispersion: k Approximate Hamming Medians
Generalized Plotkin Bound

Key Result

Theorem 1

Consider an alphabet $\Gamma$. There exists an algorithm that, given any $X \subseteq \Gamma^d$ of size $n$ and $\varepsilon > 0$, outputs two $(1+\varepsilon)$-approximate Hamming medians of $X$ with maximum diameter, and runs in time $O((1+\varepsilon)nd + d\log d)$.

Figures (1)

Figure 1: Let $\Gamma_i =\{a_1,\cdots,a_r\}$ be set of most frequent characters at the index $i$. Overview of the characters at index $i$ after the modifications by $\textnormal{Sum-Dispersion-Exact}$ Algorithm (\ref{['alg:sum-exact']})

Theorems & Definitions (38)

Theorem 1
Theorem 2
Theorem 3: Informal Statement
Theorem 4: Informal Statement
Lemma 5: Folklore
Lemma 6
Definition 7: Min Dispersion
Definition 8: Sum Dispersion
Theorem 8
Theorem 9
...and 28 more

Maximizing Diversity in (near-)Median String Selection

TL;DR

Abstract

Maximizing Diversity in (near-)Median String Selection

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (38)