The search engine implemented the K-score algorithm [ 45 ], generating comparable output for the same input files as the original implementation for mass spectrometry based proteomics.
A parallel protein structure alignment algorithm has also been proposed based on the Hadoop distributed platform [ 46 ]. The authors analysed and compared the structure alignments produced by different methods using a dataset randomly selected from the Protein Data Bank PDB database [ 19 ]. The experimental results verified that the proposed algorithm refined the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed algorithm was proportional to the number of processors used in the cloud platform.
The implementation of genome-wide association study GWAS statistical tests in the R programming language has been presented in the form of the BlueSNP R package [ 47 ], which executes calculations across clusters configured with Hadoop. An efficient algorithm for DNA fragment assembly in the MapReduce framework has been proposed [ 48 ]. The experimental results show that the parallel strategy can effectively improve the computational efficiency and remove the memory limitations of the assembly algorithm based on the Euler super path [ 49 ].
Next generation genome software mapping has been developed for SNP discovery and genotyping [ 50 ]. The software is known as Cloudburst and it is implemented on top of the Hadoop platform for the analysis of next generation sequencing data. Performance comparison studies have been conducted between a message passing interface MPI [ 51 ], Dryad [ 52 ], and a Hadoop MapReduce programming framework for measuring relative performance using three bioinformatics applications [ 53 ]. BLAST and gene set enrichment analysis GSEA algorithms have been implemented in Hadoop [ 54 ] for streaming computation on large data sets and a multi-pass computation on relatively small datasets.
The results indicate that the framework could have a wide range of bioinformatics applications while maintaining good computational efficiency, scalability, and ease of maintenance. The Hadoop platform has been used for multiple sequence alignment [ 58 ] using HBase. The reciprocal smallest distance RSD algorithm for gene sequence comparison has been redesigned to run with EC2 cloud [ 42 ].
The redesigned algorithm used ortholog calculations across a wide selection of fully sequenced genomes.
MapReduce - Wikipedia
According to their results, MapReduce provides a substantial boost to the process. Cloudgene [ 59 ] is a freely available platform to improve the usability of MapReduce programs in bioinformatics. Cloudgene is used to build a standardized graphical execution environment for currently available and future MapReduce programs, which can be integrated by using its plug-in interface. The results show that MapReduce programs can be integrated into Cloudgene with little effort and without adding any computational overhead to existing programs.
Currently, five different bioinformatics programs using MapReduce and two systems are integrated and have been successfully deployed [ 59 ].
Read PDF An Overview of MapReduce and Its Impact on Distributed Data Processing
Hydra is a genome sequence database search engine that is designed to run on top of the Hadoop and MapReduce distributed computing framework [ 60 ]. It implements the K-score algorithm [ 45 ] and generates comparable output for the same input files as the original implementation.
The results show that the software is scalable in its ability to handle a large peptide database. A parallel version of the random forest algorithm [ 61 ] for regression and genetic similarity learning tasks has been developed [ 62 ] for large-scale population genetic association studies involving multivariate traits. It is implemented using MapReduce programming framework on top of Hadoop.
The algorithm has been applied to a genome-wide association study on Alzheimer disease AD in which the quantitative characteristic consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in human brain structure and notable speed-ups in the processing are obtained. A solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations has been proposed [ 63 ].
The procedure described is an effort in decomposition and parallelization of sequence alignment in prediction of a volume of genomic sequence data, which cannot be processed using sequential programming methods. Nephele is a suite of tools [ 64 ] that uses the complete composition vector algorithm [ 65 ] to represent each genome sequence in the dataset as a vector derived from its constituent. The method is implemented using the MapReduce framework on top of the Hadoop platform. The method produces results that correlate well with expert-defined clades at a fraction of the computational cost of traditional methods [ 64 ].
A practical framework [ 66 ] based on MapReduce programming framework is developed to infer large gene networks, by developing and parallelizing a hybrid genetic algorithm particle swarm optimization GA-PSO method [ 67 ]. The authors use the open-source software GeneNetWeaver to create the gene profiles. The results show that the parallel method based on the MapReduce framework can be successfully used to gather networks with desired behaviors and the computation time can be reduced.
A method for enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce has been implemented [ 68 ]. The results show that by using statistical analysis implemented using the MapReduce framework, the inversion-based chunking methods can outperform predictions using the whole sequence.
Rainbow [ 69 ] is a cloud-based software package that can assist in the automation of large-scale whole-genome sequencing WGS data analyses to overcome the limitations of Crossbow [ 70 ], which is a software tool that can detect SNPs WGS data from a single subject. The performance of Rainbow was evaluated by analyzing 44 different whole-genome-sequenced subjects.
Rainbow has the capacity to process genomic data from more than subjects in two weeks using cloud computing provided by the Amazon Web Service. Mercury [ 71 ] is an automated, flexible, and extensible analysis workflow that provides accurate and reproducible genomic results at scales ranging from individuals to large partners. The parallel ensemble empirical mode decomposition EEMD algorithm [ 72 ] has been implemented on top of the Hadoop platform in a modern cyber infrastructure [ 73 ].
Test results and performance evaluation show that parallel EEMD can significantly improve the performance of neural signal processing. A novel approach has been proposed [ 39 ] to store and process clinical signals based on the Apache HBase distributed column-store and the MapReduce programming framework with an integrated Web-based data visualization layer. The growth in the volume of medical images produced on a daily basis in modern hospitals has forced a move away from traditional medical image analysis and indexing approaches towards scalable solutions [ 74 ].
Multi-dimensional geospatial data mining in a distributed environment using MapReduce
A cluster of heterogeneous computing nodes was set up using the Hadoop platform allowing for a maximum of 42 concurrent map tasks. The majority of the machines used were desktop computers that are also used for regular office work. The three use—cases reflect the various challenges of processing medical images in different clinical scenarios.
An ultrafast and scalable cone-beam computed tomography CT reconstruction algorithm using MapReduce in a cloud-computing environment has been proposed [ 76 ]. The map functions were used to filter and back-project subsets of projections, and reduce functions to aggregate that partial back-projection into the whole volume.
The speed up of reconstruction time was found to be roughly linear with the number of nodes employed. It tabulates the studies referenced in this paper grouped by relevant categories to indicate the following fields: study name, year, and technology used, and potential application of the algorithm or the technology used.
Health care systems in general suffer unsustainable costs and lack data utilization [ 78 ]. Therefore there is a pressing need to find solutions that can reduce unnecessary costs. Advances in health quality outcomes and cost control measures depend on using the power of large integrated databases to underline patterns and insights. However, there is much less certainty on how this clinical data should be collected, maintained, disclosed, and used. The problem in health care systems is not the lack of data, it is the lack of information that can be utilized to support critical decision-making [ 79 ].
This presents the following challenges to big data solutions in clinical facilities:. Health care is resistant to redesigning processes and approving technology that influences the health care system [ 80 ].
- Pediatrics: Otolaryngologic Observations (Audio-Digest Foundation Pediatrics Continuing Medical Education (CME). Book 57)?
- MapReduce Tutorial.
- Learn more about Map-Reduce.
Clinical data is generated from many sources e. There are lots of benefits from sharing clinical big data between researchers and scholars, however these benefits are constricted due to the privacy issues and laws that regulate clinical data privacy and access [ 81 ]. Big data solution architectures have to be flexible and adoptable to manage the variety of dispersed sources and the growth of standards and regulations e.
Big Data has a substantial potential to unlock the whole health care value chain [ 83 ].
- Il Rinaldo di Torquato Tasso (Italian Edition)?
- Map-Reduce - an overview | ScienceDirect Topics?
- Post navigation;
- Affiliate Blogging Secrets 2011 New!
- Towards Data Science.
Big data analytics changed the traditional perspective of health care systems from finding new drugs to patient-central health care for better clinical outcomes and increased efficiency. The future applications of big data in the health care system have the potential of enhancing and accelerating interactions among clinicians, administrators, lab directors, logistic mangers, and researchers by saving costs, creating better efficiencies based on outcome comparison, reducing risks, and improving personalized care.
Large amounts of health data is unstructured as documents, images, clinical or transcribed notes [ 84 ]. Research articles, review articles, clinical references, and practice guidelines are rich sources for text analytics applications that aim to discover knowledge by mining these type of text data. Genomic data represent significant amounts of gene sequencing data and applications are required to analysis and understand the sequence in regards to better understanding of patient treatment.
Streamed data home monitoring, tele-health, handheld and sensor-based wireless are well established data sources for clinical data. Social media will increase the communication between patients, physician and communities. Consequently, analytics are required to analyse this data to underline emerging outbreak of disease, patient satisfaction, and compliance of patient to clinical regulations and treatments. Administrative data such as billing, scheduling, and other non-health data present an exponentially growing source of data.
Analysing and optimizing this kind of data can save large amounts of money and increase the sustainability of a health care facility [ 78 , 79 , 83 ]. The aforementioned types of clinical data sources provide a rich environment for research and give rise to many future applications that can be analysed for better patient treatment outcomes and a more sustainable health care system.
Related An Overview of MapReduce and Its Impact on Distributed Data Processing
Copyright 2019 - All Right Reserved