Accel-align: A fast sequence mapper and aligner based on the seed–embed–extend method

Yan, Yiqing; Chaturvedi, Nimisha; Appuswamy, Raja

BMC Bioinformatics, 22 March 2021

Motivation: Improvements in sequencing technology continue to drive sequencing cost towards 100$ per genome. However, mapping sequenced data to a reference genome remains a computationallyintensive task due to the dependence on edit distance for dealing with indels and mismatches introduced by sequencing. All modern aligners use seed–filter–extend (SFE) methodology and rely on filtration heuristics to reduce the overhead of edit distance computation. However, filtering makes assumptions about error patterns, has inherent performance–accuracy trade-offs, and requires careful hand tuning to match the dataset. Results: Motivated by algorithmic advances in randomized low-distortion embeddings, we introduce seed–embed–extend (SEE), a new design methodology for developing sequence mappers and aligners. While SFE focuses on eliminating sub-optimal candidates, SEE focuses instead on identifying optimal candidates. To do so, SEE transforms the read and reference strings from edit distance regime to the Hamming regime by embedding them using a randomized algorithm, and uses Hamming distance over the embedded set to identify optimal candidates. To show that SEE perfoms well in practice, we present Accel-Align, an SEE-based short-read sequence mapper and aligner that is 3-12× faster than state-of-the-art aligners on commodity CPUs, without any special-purpose hardware, while providing comparable accuracy

Detail

Document

DOI

BIBTEX

Type:

Journal

Date:

2021-05-20

Department:

Data Science

Eurecom Ref:

6547

© Springer. Personal use of this material is permitted. The definitive version of this paper was published in BMC Bioinformatics, 22 March 2021 and is available at : https://doi.org/10.1101/2020.07.20.211888