Scalable Consistency in T-Coffee Through Apache Spark and Cassandra Database
MetadataShow full item record
Next-generation sequencing, also known as high-throughput sequencing, has increased the volume of genetic data processed by sequencers. In the bioinformatic scientific area, highly rated multiple sequence alignment tools, such as MAFFT, ProbCons, and T-Coffee (TC), use the probabilistic consistency as a prior step to the progressive alignment stage to improve the final accuracy. However, such methods are severely limited by the memory required to store the consistency information. Big data processing and persistence techniques are used to manage and store the huge amount of information that is generated. Although these techniques have significant advantages, few biological applications have adopted them. In this article, a novel approach named big data tree-based consistency objective function for alignment evaluation (BDT-Coffee) is presented. BDT-Coffee is based on the integration of consistency information through Cassandra database in TC, previously generated by the MapReduce processing paradigm, to enable large data sets to be processed with the aim of improving the performance and scalability of the original algorithm.
Is part ofJournal of Computational Biology, 2018, vol, 25, nun. 8, p. 894-906
European research projects
Showing items related by title, author, creator and subject.
High Performance computing improvements on bioinformatics consistency-based multiple sequence alignment tools Orobitg Cortada, Miquel; Guirado Fernández, Fernando; Cores Prado, Fernando; Lladós Segura, Jordi; Notredame, Cedric (Elsevier, 2014-10-08)Multiple Sequence Alignment (MSA) is essential for a wide range of applications in Bioinformatics. Traditionally, the alignment accuracy was the main metric used to evaluate the goodness of MSA tools. However, with the ...
Lladós Segura, Jordi; Guirado Fernández, Fernando; Cores Prado, Fernando; Lérida Monsó, Josep Lluís; Notredame, Cedric (Springer Verlag, 2015-05-01)Multiple sequence alignment (MSA) is crucial for high-throughput next generation sequencing applications. Large-scale alignments with thousands of sequences are necessary for these applications. However, the quality of the ...
PPCAS: Implementation of a Probabilistic Pairwise Model for Consistency-Based Multiple Alignment in Apache Spark Lladós Segura, Jordi; Guirado Fernández, Fernando; Cores Prado, Fernando (Springer, 2017)Large-scale data processing techniques, currently known as Big-Data, are used to manage the huge amount of data that are generated by sequencers. Although these techniques have significant advantages, few biological ...