Table of Contents
Fetching ...

AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

Jeremie S. Kim, Can Firtina, Meryem Banu Cavlak, Damla Senol Cali, Mohammed Alser, Nastaran Hajinazar, Can Alkan, Onur Mutlu

TL;DR

This work reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4× and finds that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants.

Abstract

AirLift is the first read remapping tool that enables users to quickly and comprehensively map a read set, that had been previously mapped to one reference genome, to another similar reference. Users can then quickly run a downstream analysis of read sets for each latest reference release. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4x. We validate our remapping results with GATK and find that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants AirLift source code and readme describing how to reproduce our results are available at https://github.com/CMU-SAFARI/AirLift.

AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

TL;DR

This work reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4× and finds that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants.

Abstract

AirLift is the first read remapping tool that enables users to quickly and comprehensively map a read set, that had been previously mapped to one reference genome, to another similar reference. Users can then quickly run a downstream analysis of read sets for each latest reference release. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4x. We validate our remapping results with GATK and find that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants AirLift source code and readme describing how to reproduce our results are available at https://github.com/CMU-SAFARI/AirLift.

Paper Structure

This paper contains 25 sections, 1 equation, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Limitations of Existing Remapping Tools. Existing remapping tools correctly remap reads that mapped completely within a region indicated by the chain file (e.g., Read 2). However, these tools 1) cannot remap reads that mapped within a region in the old reference that does not appear in the new reference (e.g., Read 1) and 2) may incorrectly remap reads that align to multiple constant regions in the old reference (e.g., Read 3).
  • Figure 2: An example pair of reference genomes (old and new) with regions labeled (as constant, updated, retired, and new regions) and associated with each other according to their degrees of similarity. Regions that are associated with (i.e., similar to) each other are indicated with an arrow. Example differences across associated updated regions are shown with black vertical bars.
  • Figure 3: AirLift uses eight key steps to identify and label regions in the old and new reference genomes as constant, updated, retired, or new in order to efficiently map any number of reads from an old reference genome to a new reference genome.
  • Figure 4: Using AirLift to remap a read set. AirLift remaps each read differently depending on the label of the region in the old reference that the read had originally mapped to: constant, updated, retired, or unmapped.
  • Figure 5: AirLift execution time results. We show the execution time (log-scale y-axis) of running three remapping tools, CrossMap (blue), AirLift (orange), and LiftOver (green) on a read set to a new reference genome against the baseline (red) of fully mapping a read set to the new reference genome. We plot the execution times of each tool for various pairs of reference genomes (x-axis; where the old reference is at the bottom and the new reference is above the old reference) in three separate plots for different sizes of reference genomes, i.e., large (human), medium (C. elegans), small (yeast). We indicate the speedup of AirLift against the full mapping baseline above each grouping of bars, since AirLift and the baseline are the only comprehensive and accurate remapping techniques available.
  • ...and 3 more figures