Optimizing the accuracy of randomized embedding for sequence alignment

Yan, Yiqing; Chaturvedi, Nimisha; Appuswamy, Raja
IPDPS 2022, 36th IEEE International Parallel & Distributed Processing Symposium, 30 May-3 Juin, 2022, Lyon, France (Virtual Event)

Gapped alignment of sequenced data to a reference genome has traditionally been a computationally-intensive task due to the use of edit distance for dealing with indels and
mismatches introduced by sequencing. In prior work, we developed Accel-Align [1], a Seed–Embed–Extend (SEE) sequence aligner that uses randomized embedding algorithms to quickly identify optimal candidate locations using Hamming distance rather than edit distance. While Accel-Align provides up to an order of magnitude improvement over state-of-the-art aligners, the randomized nature of embedding can lead to alignment errors
resulting in lower precision and recall with downstream variant callers. In this work, we propose several techniques for improving the accuracy of randomized embedding-based sequence alignment. We provide an efficient implementation of these techniques in Accel-Align, and use it to present a comparative evaluation that demonstrates that the accuracy improvements can be achieved without sacrificing performance. Code is accessible
in github.com/raja-appuswamy/accel-align-release.

DOI
HAL
Type:
Conference
City:
Lyon
Date:
2022-05-30
Department:
Data Science
Eurecom Ref:
6909
Copyright:
© 2022 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

PERMALINK : https://www.eurecom.fr/publication/6909