In previous post I mentioned about our Data Analysis as Service and its motivation & goals. In this post, I will describe what we have been doing in this area and our plans ahead.
We have been working in Big data for more than a year now at Uninett. We have a cluster with 18 physical machines running Apache Mesos, a resource manager, with Apache Spark, a distributed fault tolerant processing system with support for machine learning algorithms, on top of it to process data in a distributed way. In 2014, we started to prototype with Apache Spark to process Netflow data from our network routers and performed some basic analysis. The results we got were really surprising in terms of performance and ease of use. This motivated us to start looking around which other areas can utilize the benefit of scalable distributed processing with ease of using machine learning on big data.
We get in touch with Proteomics and Metabolomics Core Facility (PROMEC) here in Trondheim. They have substantial amount of Protein data sets and are interested in using Apache Spark to analyze their data sets with machine learning and in scalable way. It resulted in two different projects with PROMEC and Uninett to analyze protein data sets.
- The main challenge to start using Spark to analyze the proteins data sets was that the tools used by community are geared towards a monolithic system and not suitable for distributed systems. Therefore we started with Spark Hydra to add support for Proteomics algorithm e.g. Xtandem, Commet into Spark. This enabled us to scale Proteomics analysis without worrying about scale of data sets. We are currently working on improving its accuracy and results will be published later.
- Another project is to use machine learning from Spark to do clustering on Proteomics MGF data produced by high throughput sequencer. First we calculated the Within Set Sum of Squared Error (WSSSE) with changing number of cluster numbers.
This helped in finding optimum number of cluster in our data sets. Then we train our model with optimum cluster number. Once trained, we tested on our full data sets, to see if there were any cluster forming out of our normal M/Z ratio range. This helped us to check if there were any issues with our sequencer.
As you can see we have a small cluster forming at approx 680 M/Z value, this can be due to some issues in our sequencer. Now we plan to do it in real-time to help us detect and do the quality check on out readings from sequencer.
With the rise of high throughput sequencer and decrease in cost of sequencing genome resulted in rise of genomic data sets. The sequencing cost is reducing more than Moore’s law due to advancement in sequencing technology.
This is resulting in big datasets which are harder to process using traditional means of single high end server. As data size keeps increasing it requires processing data in parallel and using data locality principle to avoid unnecessary data shuffling on the network. We are looking into a project called ADAM built on top of Apache Spark to analyze genomic data set using data locality and scalable way to process genomic data sets. As combination of Spark’s machine learning support an ADAM’s support for genomics data makes this platform a good choice.
Recent advancement in recoding brain activity has resulted in increase of neural data sets. This is an excerpt from a paper “Mapping brain activity at scale with cluster computing” published in Nature describing challenges and methods to address the recent increase
Understanding brain function requires monitoring and interpreting the activity of large networks of neurons during behavior. Advances in recording technology are greatly increasing the size and complexity of neural data. Analyzing such data will pose a fundamental bottleneck for neuroscience. We present a library of analytical tools called Thunder built on the open-source Apache Spark platform for large-scale distributed computing. We demonstrate how these analyses find structure in large-scale neural data, including whole-brain light-sheet imaging data from fictively behaving larval zebrafish, and two-photon imaging data from behaving mouse. The analyses relate neuronal responses to sensory input and behavior, run in minutes or less and can be used on a private cluster or in the cloud. Our open-source framework thus holds promise for turning brain activity mapping efforts into biological insights.
in neural data sets. The researchers were able to benefit from Spark’s distributed and fault tolerant processing capabilities combined with support for processing streaming data to analyze Zebrafish neural activity in real time.
The framework used for this work is open sourced, called Thunder. We are now looking in to test it out with support of researchers who are in need to analyze large scale neural data set.
The fields which can benefit from processing large data sets efficiently are not limited to Bioinformatics only. For example Computational Linguistics has data sets available from Common crawl and Amazon in size of 500+ TB. Similarly Satellite datasets from Amazon is also of size 200+TB. To process such large datasets efficiently with advanced analytics e.g. machine learning, we need methods to process data locally and has technologies which enables us to do it. Apache Spark is one, but in future there might be more and we at Uninett & Simga2 are working with researchers to support them to handle the rise of big data with modern methods.
If you or your research group affiliated with universities in Norway and would like to test out these technologies or similar big data technologies then do not hesitate to contact us. Currently It is available to all researchers in Norway without any cost.