Variações do Problema de Distância de Rearranjos

Alexsandro Oliveira Alexandrino

Variações do Problema de Distância de Rearranjos

Alexsandro Oliveira Alexandrino

TL;DR

This work studies genome rearrangement distance problems across balanced and unbalanced genomes, incorporating both gene order and intergenic region sizes. It introduces a comprehensive set of representations (permutations with signs, strings, intergenic lists) and a family of breakpoint- and cycle-graph-based techniques to analyze and approximate sorting by reversals, transpositions, and combinations with indels and block interchanges. Theoretical results include NP-hardness proofs for many model variants and a new $1.375$-approximation for Sorting by Transpositions with $O(n^5)$ time, together with improved $2$-, $3$-, and $4$-approximation schemes across unbalanced and intergenic models, plus practical experiments on synthetic and real genomes (e.g., Cyanorak 2.1) demonstrating favorable approximation performance and phylogenetic utility. The work advances both the theory and practice of genome rearrangement distances, providing scalable algorithms and actionable insights for comparative genomics and phylogeny reconstruction in the presence of intergenic regions and gene-content variation.

Abstract

Considering a pair of genomes, the goal of rearrangement distance problems is to estimate how distant these genomes are from each other based on genome rearrangements. Seminal works in genome rearrangements assumed that both genomes being compared have the same set of genes (balanced genomes) and, furthermore, only the relative order of genes and their orientations, when they are known, are used in the mathematical representation of the genomes. In this case, the genomes are represented as permutations, originating the Sorting Permutations by Rearrangements problems. The main problems of Sorting Permutations by Rearrangements considered DCJs, reversals, transpositions, or the combination of both reversals and transpositions, and these problems have their complexity known. Besides these problems, other ones were studied involving the combination of transpositions with one or more of the following rearrangements: transreversals, revrevs, and reversals. Although there are approximation results for these problems, their complexity remained open. Some of the results of this thesis are the complexity proofs for these problems. Furthermore, we present a new 1.375-approximation algorithm, which has better time complexity, for the Sorting Permutations by Transpositions. When considering unbalanced genomes, it is necessary to use insertions and deletions to transform one genome into another. In this thesis, we studied Rearrangement Distance problems on unbalanced genomes considering only gene order and their orientations (when they are known), as well as Intergenic Rearrangement Distance problems, which incorporate the information regarding the size distribution of intergenic regions, besides the use of gene order and their orientations (when they are known). We present complexity proofs and approximation algorithms for problems that include reversals and transpositions.

Variações do Problema de Distância de Rearranjos

TL;DR

-approximation for Sorting by Transpositions with

time, together with improved

-, and

-approximation schemes across unbalanced and intergenic models, plus practical experiments on synthetic and real genomes (e.g., Cyanorak 2.1) demonstrating favorable approximation performance and phylogenetic utility. The work advances both the theory and practice of genome rearrangement distances, providing scalable algorithms and actionable insights for comparative genomics and phylogeny reconstruction in the presence of intergenic regions and gene-content variation.

Abstract

Paper Structure (45 sections, 50 equations, 26 figures, 26 tables, 15 algorithms)

This paper contains 45 sections, 50 equations, 26 figures, 26 tables, 15 algorithms.

Introdução
Fundamentação Teórica
Representação de Genomas
Representação da Ordem Relativa dos Genes
Representação de Regiões Intergênicas
Rearranjos de Genomas
Efeito dos Rearranjos de Genomas em Regiões Intergênicas
Problemas de Distância de Rearranjos
Breakpoints
Breakpoints em Permutações
Breakpoints em Genomas Desbalanceados
Breakpoints Intergênicos
Grafo de Ciclos
Grafo de Ciclos para Permutações
Grafo de Ciclos Rotulado
...and 30 more sections

Figures (26)

Figure 1: Exemplo de dois genomas $\mathcal{G}_o$ e $\mathcal{G}_d$, onde genes são representados por letras dentro de setas, a orientação dos genes é indicada pela orientação das setas, e os tamanhos das regiões intergênicas são representados por números dentro de retângulos. Os genes de $\mathcal{G}_d$ são mapeados da seguinte forma: $a$ é mapeado em $+1$, $c$ é mapeado em $+2$, $d$ é mapeado em $+3$, $h$ é mapeado em $+4$, e $f$ é mapeado em $+5$. Assim, o genoma $\mathcal{G}_d$ é representado por $(\iota^n, \breve{\iota}^n)$, onde $\iota^n = ({+1}~{+2}~{+3}~{+4}~{+5})$ e $\breve{\iota}^n = (5,2,7,1,0,5)$. O gene $x$ e o segmento que vai de $y$ até $z$ em $\mathcal{G}_o$ não estão presentes em $\mathcal{G}_d$. Portanto, ambos são mapeados no elemento $\alpha$. O genoma $\mathcal{G}_o$ é representado por $(A, \breve{A})$, onde $A = ({+4}~{+3}~\alpha~{-1}~{+2}~\alpha)$ e $\breve{A} = (0,3,2,3,10,2,6)$. Os alfabetos $\Sigma_A$ e $\Sigma_{\iota^n}$ são os conjuntos $\{1,2,3,4,\alpha\}$ e $\{1,2,3,4,5\}$, respectivamente.
Figure 2: Exemplo de dois genomas $\mathcal{G}_o$ e $\mathcal{G}_d$, onde genes são representados por letras dentro de círculos, a orientação dos genes é desconhecida, e os tamanhos das regiões intergênicas são representados por números dentro de retângulos. Os genes de $\mathcal{G}_d$ são mapeados da seguinte forma: $a$ é mapeado em $1$, $c$ é mapeado em $2$, $d$ é mapeado em $3$, $h$ é mapeado em $4$, e $f$ é mapeado em $5$. Assim, o genoma $\mathcal{G}_d$ é representado por $(\iota^n, \breve{\iota}^n)$, onde $\iota^n = ({1}~{2}~{3}~{4}~{5})$ e $\breve{\iota}^n = (5,2,7,1,0,5)$. O gene $x$ e o segmento que vai de $y$ até $z$ em $\mathcal{G}_o$ não estão presentes em $\mathcal{G}_d$. Portanto, ambos são mapeados no elemento $\alpha$. O genoma $\mathcal{G}_o$ é representado por $(A, \breve{A})$, onde $A = ({4}~{3}~\alpha~{1}~{2}~\alpha)$ e $\breve{A} = (0,3,2,3,10,2,6)$. Os alfabetos $\Sigma_A$ e $\Sigma_{\iota^n}$ são os conjuntos $\{1,2,3,4,\alpha\}$ e $\{1,2,3,4,5\}$, respectivamente.
Figure 3: Grafo de ciclos $G(\pi)$ da permutação sem sinais $\pi=(5~4~1~6~3~2)$. Linhas horizontais e arcos representam arestas de origem e arestas de destino, respectivamente. O índice de uma aresta de origem é indicado por um número abaixo dessa aresta. Neste exemplo, temos três ciclos em $G(\pi)$: $C_1 = (3, 1)$, $C_2 = (6, 2, 4)$ e $C_3 = (7,5)$. O ciclo $C_2$ é ímpar e os ciclos $C_1$ e $C_3$ são pares.
Figure 4: Grafo de ciclos $G(\pi)$ da permutação com sinais $\pi = ({+5}~{+4}~{+1}~{-6}~{-3}~{-2})$. Neste exemplo, temos 4 ciclos em $G(\pi)$: $C_1 = (3,1)$, $C_2 = (5, 2)$, $C_3 = (6)$ e $C_4 = (7, 4)$.
Figure 5: Grafo de ciclos rotulado $G(\mathcal{I}) = G(A, \iota^n)$ para as strings $\iota^n$, com $n = 11$, e $A = (\alpha~{+7}~{\alpha}~{-5}~{-4}~{+3}~{-2}~{+9}~{+11}~{+10})$. Existem quatro ciclos nesse grafo. O ciclo $C_1 = (6, 1, 2)$ é um ciclo rotulado divergente. Todos os outros ciclos são ciclos limpos. O ciclo $C_2 = (3)$ é um ciclo unitário, o ciclo $C_3 = (5, 4)$ é um ciclo divergente, e o ciclo $C_4 = (9, 7, 8)$ é um ciclo orientado.
...and 21 more figures

Theorems & Definitions (78)

proof
proof
proof
proof
proof
proof
proof
proof
proof
proof
...and 68 more

Variações do Problema de Distância de Rearranjos

TL;DR

Abstract

Variações do Problema de Distância de Rearranjos

Authors

TL;DR

Abstract

Table of Contents

Figures (26)

Theorems & Definitions (78)