Genomics: SegAlign
The advent of low-cost, high throughput genome sequencing technologies has triggered an unprecedented growth in genomic data that has far-surpassed Moore's law. Traditional multi-core software systems are no longer enough to extract useful biological insights by analyzing these genome sequences. Our research aims to explore various genomic analysis pipelines through hardware-software-algorithm co-design.
SegAlign is a scalable, GPU-based system for computing pairwise Whole Genome Alignment (WGA). Multiple large-scale projects are currently underway to sequence and assemble the genomes of millions of species over the next several years. Pairwise WGA is the crucial first step to unlocking fascinating biological discoveries from these genomes. However, computing these for even a fraction of the millions of possible pairs is prohibitive – WGA of a single pair of vertebrate genomes (human-mouse) takes 11 hours on a 96-core Amazon Web Services (AWS) instance (c5.24xlarge).
SegAlign is based on the standard seed-filter-extend heuristic (LASTZ), in which the filtering stage dominates the runtime (e.g. 98% for human-mouse WGA), and is accelerated using GPU(s). Using three vertebrate genome pairs, we show that SegAlign provides a speedup of up to 14x on an 8-GPU, 64-core AWS instance (p3.16xlarge) for WGA and nearly 2.3x reduction in dollar cost. SegAlign also allows parallelization over multiple GPU nodes and scales efficiently.
SegAlign's ungapped extension kernel has been integrated into NVIDIA’s GenomeWorks library as a standalone API. SegAlign has also been integrated into the Cactus multiple genome alignment tool. In the near future, SegAlign-accelerated version of Cactus will be used to generate thousand-way vertebrate genome alignments.
Code: https://github.com/gsneha26/SegAlign
Related publication
S. Goenka, Y. Turakhia, B. Paten and M. Horowitz, "SegAlign: A Scalable GPU-Based Whole Genome Aligner," in 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Atlanta, GA, US, 2020 pp. 540-552.