Optimization of Consistency-Based Multiple Sequence Alignment using Big Data technologies
MetadataShow full item record
With the advent of new high-throughput next-generation sequencing technologies, the volume of genetic data processed has increased significantly. It is becoming essential for these applications to achieve large-scale alignments with thousands of sequences or even whole genomes. However, all current MSA tools have exhibited scalability issues when the number of sequences increases. The main drawback of these methods is that errors made in early pairwise alignments are propagated to the nal result, a ecting the accuracy of the global alignment. The use of consistency information enables the nal result to be improved and makes it more stable from the accuracy point of view. However, such methods are severely limited by the memory required to store the consistency information. Authors in a previous work analyzed the structure and distribution of the data stored in the constraint library and demonstrated that it could be possible to reduce it without loosing accuracy and thus it is possible to increase the number of sequences to be aligned. However, the execution time for obtaining the constraint library for a bigger number of sequences also increases greatly. In the present paper, the authors apply Big Data technologies to take advantage of the high degree of parallelism provided by the MapReduce paradigm in order to reduce considerably the library calculation time. Moreover, Big Data infrastructure provides a distributed storage system to improve the library scalability and machine-learning algorithms to enhance the consistency selection policies.
Is part ofThe Journal of Supercomputing, 2019, vol. 75, p. 1310–1322
European research projects
Showing items related by title, author, creator and subject.
PPCAS: Implementation of a Probabilistic Pairwise Model for Consistency-Based Multiple Alignment in Apache Spark Lladós Segura, Jordi; Guirado Fernández, Fernando; Cores Prado, Fernando (Springer, 2017)Large-scale data processing techniques, currently known as Big-Data, are used to manage the huge amount of data that are generated by sequencers. Although these techniques have significant advantages, few biological ...
High Performance computing improvements on bioinformatics consistency-based multiple sequence alignment tools Orobitg Cortada, Miquel; Guirado Fernández, Fernando; Cores Prado, Fernando; Lladós Segura, Jordi; Notredame, Cedric (Elsevier, 2014-10-08)Multiple Sequence Alignment (MSA) is essential for a wide range of applications in Bioinformatics. Traditionally, the alignment accuracy was the main metric used to evaluate the goodness of MSA tools. However, with the ...
Lladós Segura, Jordi; Cores Prado, Fernando; Guirado Fernández, Fernando (Mary Ann Liebert, 2018)Next-generation sequencing, also known as high-throughput sequencing, has increased the volume of genetic data processed by sequencers. In the bioinformatic scientific area, highly rated multiple sequence alignment tools, ...