Scalable Consistency in T-Coffee Through Apache Spark and Cassandra Database
Loading...
Date
2018
Other authors
Impact
Journal Title
Journal ISSN
Volume Title
Abstract
Next-generation sequencing, also known as high-throughput sequencing,
has increased the volume of genetic data processed by sequencers. In the bioinformatic scientific area, highly rated multiple sequence alignment tools, such as MAFFT,
ProbCons, and T-Coffee (TC), use the probabilistic consistency as a prior step to the
progressive alignment stage to improve the final accuracy. However, such methods
are severely limited by the memory required to store the consistency information.
Big data processing and persistence techniques are used to manage and store the
huge amount of information that is generated. Although these techniques have significant advantages, few biological applications have adopted them. In this article, a
novel approach named big data tree-based consistency objective function for alignment evaluation (BDT-Coffee) is presented. BDT-Coffee is based on the integration
of consistency information through Cassandra database in TC, previously generated
by the MapReduce processing paradigm, to enable large data sets to be processed
with the aim of improving the performance and scalability of the original algorithm.
Citation
Journal or Serie
Journal of Computational Biology, 2018, vol, 25, nun. 8, p. 894-906