Table of Contents
Fetching ...

A Three-Stage Algorithm for the Closest String Problem on Artificial and Real Gene Sequences

Alireza Abdi, Marko Djukanovic, Hesam Tahmasebi Boldaji, Hadis Salehi, Aleksandar Kartelj

TL;DR

The paper tackles the Closest String Problem (CSP), an NP-hard task with applications in coding theory and computational biology, by proposing the Three-Stage Algorithm (TSA). TSA combines (i) an alphabet pruning scheme based on Rank 1/Rank 2 frequencies to reduce the search space, (ii) a Time-Restricted Beam Search (TRBS) guided by an Expected Distance Heuristic (EX) to steer search toward promising regions, and (iii) a targeted local search to refine the final solution. The authors evaluate TSA against ILP, GWSA, and WFC on five benchmark sets, including artificial DNA/protein data and real-world TP53 and flu datasets, showing TSA achieving state-of-the-art performance in most cases and favorable runtimes on larger instances. They also provide extensive statistical analysis confirming the robustness and significance of TSA’s improvements, and discuss potential future directions such as Monte Carlo Tree Search and learning-based guidance to further enhance CSP solving in biosequence contexts.

Abstract

The Closest String Problem is an NP-hard problem that aims to find a string that has the minimum distance from all sequences that belong to the given set of strings. Its applications can be found in coding theory, computational biology, and designing degenerated primers, among others. There are efficient exact algorithms that have reached high-quality solutions for binary sequences. However, there is still room for improvement concerning the quality of solutions over DNA and protein sequences. In this paper, we introduce a three-stage algorithm that comprises the following process: first, we apply a novel alphabet pruning method to reduce the search space for effectively finding promising search regions. Second, a variant of beam search to find a heuristic solution is employed. This method utilizes a newly developed guiding function based on an expected distance heuristic score of partial solutions. Last, we introduce a local search to improve the quality of the solution obtained from the beam search. Furthermore, due to the lack of real-world benchmarks, two real-world datasets are introduced to verify the robustness of the method. The extensive experimental results show that the proposed method outperforms the previous approaches from the literature.

A Three-Stage Algorithm for the Closest String Problem on Artificial and Real Gene Sequences

TL;DR

The paper tackles the Closest String Problem (CSP), an NP-hard task with applications in coding theory and computational biology, by proposing the Three-Stage Algorithm (TSA). TSA combines (i) an alphabet pruning scheme based on Rank 1/Rank 2 frequencies to reduce the search space, (ii) a Time-Restricted Beam Search (TRBS) guided by an Expected Distance Heuristic (EX) to steer search toward promising regions, and (iii) a targeted local search to refine the final solution. The authors evaluate TSA against ILP, GWSA, and WFC on five benchmark sets, including artificial DNA/protein data and real-world TP53 and flu datasets, showing TSA achieving state-of-the-art performance in most cases and favorable runtimes on larger instances. They also provide extensive statistical analysis confirming the robustness and significance of TSA’s improvements, and discuss potential future directions such as Monte Carlo Tree Search and learning-based guidance to further enhance CSP solving in biosequence contexts.

Abstract

The Closest String Problem is an NP-hard problem that aims to find a string that has the minimum distance from all sequences that belong to the given set of strings. Its applications can be found in coding theory, computational biology, and designing degenerated primers, among others. There are efficient exact algorithms that have reached high-quality solutions for binary sequences. However, there is still room for improvement concerning the quality of solutions over DNA and protein sequences. In this paper, we introduce a three-stage algorithm that comprises the following process: first, we apply a novel alphabet pruning method to reduce the search space for effectively finding promising search regions. Second, a variant of beam search to find a heuristic solution is employed. This method utilizes a newly developed guiding function based on an expected distance heuristic score of partial solutions. Last, we introduce a local search to improve the quality of the solution obtained from the beam search. Furthermore, due to the lack of real-world benchmarks, two real-world datasets are introduced to verify the robustness of the method. The extensive experimental results show that the proposed method outperforms the previous approaches from the literature.
Paper Structure (16 sections, 7 equations, 2 figures, 5 tables, 2 algorithms)