Table of Contents
Fetching ...

A step toward a reinforcement learning de novo genome assembler

Kleber Padovani, Roberto Xavier, Rafael Cabral Borges, Andre Carvalho, Anna Reali, Annie Chateau, Ronnie Alves

TL;DR

This study expanded upon the sole previous approach found in the literature to solve the problem of genome assembly by carefully exploring the learning aspects of the proposed intelligent agent, which uses the Q-learning algorithm, and provided insights for the next steps of automated genome assembly development.

Abstract

De novo genome assembly is a relevant but computationally complex task in genomics. Although de novo assemblers have been used successfully in several genomics projects, there is still no 'best assembler', and the choice and setup of assemblers still rely on bioinformatics experts. Thus, as with other computationally complex problems, machine learning may emerge as an alternative (or complementary) way for developing more accurate and automated assemblers. Reinforcement learning has proven promising for solving complex activities without supervision - such games - and there is a pressing need to understand the limits of this approach to 'real' problems, such as the DFA problem. This study aimed to shed light on the application of machine learning, using reinforcement learning (RL), in genome assembly. We expanded upon the sole previous approach found in the literature to solve this problem by carefully exploring the learning aspects of the proposed intelligent agent, which uses the Q-learning algorithm, and we provided insights for the next steps of automated genome assembly development. We improved the reward system and optimized the exploration of the state space based on pruning and in collaboration with evolutionary computing. We tested the new approaches on 23 new larger environments, which are all available on the internet. Our results suggest consistent performance progress; however, we also found limitations, especially concerning the high dimensionality of state and action spaces. Finally, we discuss paths for achieving efficient and automated genome assembly in real scenarios considering successful RL applications - including deep reinforcement learning.

A step toward a reinforcement learning de novo genome assembler

TL;DR

This study expanded upon the sole previous approach found in the literature to solve the problem of genome assembly by carefully exploring the learning aspects of the proposed intelligent agent, which uses the Q-learning algorithm, and provided insights for the next steps of automated genome assembly development.

Abstract

De novo genome assembly is a relevant but computationally complex task in genomics. Although de novo assemblers have been used successfully in several genomics projects, there is still no 'best assembler', and the choice and setup of assemblers still rely on bioinformatics experts. Thus, as with other computationally complex problems, machine learning may emerge as an alternative (or complementary) way for developing more accurate and automated assemblers. Reinforcement learning has proven promising for solving complex activities without supervision - such games - and there is a pressing need to understand the limits of this approach to 'real' problems, such as the DFA problem. This study aimed to shed light on the application of machine learning, using reinforcement learning (RL), in genome assembly. We expanded upon the sole previous approach found in the literature to solve this problem by carefully exploring the learning aspects of the proposed intelligent agent, which uses the Q-learning algorithm, and we provided insights for the next steps of automated genome assembly development. We improved the reward system and optimized the exploration of the state space based on pruning and in collaboration with evolutionary computing. We tested the new approaches on 23 new larger environments, which are all available on the internet. Our results suggest consistent performance progress; however, we also found limitations, especially concerning the high dimensionality of state and action spaces. Finally, we discuss paths for achieving efficient and automated genome assembly in real scenarios considering successful RL applications - including deep reinforcement learning.

Paper Structure

This paper contains 19 sections, 8 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Example of a state space. Illustration of a full state space for a set of two reads, here referred to by A and B.
  • Figure 2: Illustration of the application of reinforcement learning to the genome assembly problem. The set of reads is represented computationally by a reinforcement learning environment. Through successive interactions with the environment, caused by taking actions, the agent ideally learns the correct order of reads --- reaching the target genome.
  • Figure 3: Illustration of the pruning procedure. State space corresponding to the assembly of 3 reads, referred by a, b and c. The generic pruning procedure is defined in detail by Algorithm \ref{['alg:prunning']}
  • Figure 4: Illustration of the proposed interaction between reinforcement learning (RL) and the genetic algorithm. At each RL episode, the actions taken by the agent are converted into the chromosome (each action as a gene) of an individual of the initial population of the genetic algorithm, whose size $n$ is predefined. After $n$ episodes ($n$ individuals in the initial population), this population evolves for an predefined number of generations through the genetic algorithm. Then, the most adapted individual of the last generation is obtained. In the end, that individual's chromosomal genes are used as actions in the next RL episode.
  • Figure 5: Flowchart representing Approaches 1.1, 1.2, 1.3, 1.4, 2 and 3.1. Approaches 1.1, 1.2, 1.3, and 1.4 are defined by the elements in gray, Approach 2 by the dashed element with double edges and Approach 3.1 by the dashed elements with single border.
  • ...and 1 more figures