Data Analysis as a Service

Genomics, Proteomics, Neuroinformatics.. *ics and Big Data

In previous post I mentioned about our Data Analysis as Service and its motivation & goals. In this post, I will describe what we have been doing in this area and our plans ahead.

We have been working in Big data for more than a year now at Uninett. We have a cluster with 18 physical machines running Apache Mesos, a resource manager, with Apache Spark, a distributed fault tolerant processing system with support for machine learning algorithms, on top of it to process data in a distributed way. In 2014, we started to prototype with Apache Spark to process Netflow data from our network routers and performed some basic analysis. The results we got were really surprising in terms of performance and ease of use. This motivated us to start looking around which other areas can utilize the benefit of scalable distributed processing with ease of using machine learning on big data.

Proteomics

We get in touch with Proteomics and Metabolomics Core Facility (PROMEC) here in Trondheim. They have substantial amount of Protein data sets and are interested in using Apache Spark to analyze their data sets with machine learning and in scalable way. It resulted in two different projects with PROMEC and Uninett to analyze protein data sets.

  • The main challenge to start using Spark to analyze the proteins data sets was that the tools used by community are geared towards a monolithic system and not suitable for distributed systems. Therefore we started with Spark Hydra to add support for Proteomics algorithm e.g. Xtandem, Commet into Spark. This enabled us to scale Proteomics analysis without worrying about scale of data sets. We are currently working on improving its accuracy and results will be published later.
  • Another project is to use machine learning from Spark to do clustering on Proteomics MGF data produced by high throughput sequencer. First we calculated the Within Set Sum of Squared Error (WSSSE) with changing number of cluster numbers.
    wssse

    WSSSE value on Y-axis with increase in Cluster Count on X-axis

    This helped in finding optimum number of cluster in our data sets. Then we train our model with optimum cluster number. Once trained, we tested on our full data sets, to see if there were any cluster forming out of our normal M/Z ratio range. This helped us to check if there were any issues with our sequencer.

    cluster

    Various clusters in MGF data sets with M/Z ration on X-axis and Intensity in Y-axis

    As you can see we have a small cluster forming at approx 680 M/Z value, this can be due to some issues in our sequencer. Now we plan to do it in real-time to help us detect and do the quality check on out readings from sequencer.

Genomics

With the rise of high throughput sequencer and decrease in cost of sequencing genome resulted in rise of genomic data sets. The sequencing cost is reducing more than Moore’s law due to advancement in sequencing technology.

Cost Per Genome

Cost Per Genome Credit NIH/Wikipedia

This is resulting in big datasets which are harder to process using traditional means of single high end server. As data size keeps increasing it requires processing data in parallel and using data locality principle to avoid unnecessary data shuffling on the network. We are looking into a project called ADAM built on top of Apache Spark to analyze genomic data set using data locality and scalable way to process genomic data sets. As combination of Spark’s machine learning support an ADAM’s support for genomics data makes this platform a good choice.

Neuroinformatics

Recent advancement in recoding brain activity has resulted in increase of neural data sets. This is an excerpt from a paper “Mapping brain activity at scale with cluster computing” published in Nature describing challenges and methods to address the recent increase

Understanding brain function requires monitoring and interpreting the activity of large networks of neurons during behavior. Advances in recording technology are greatly increasing the size and complexity of neural data. Analyzing such data will pose a fundamental bottleneck for neuroscience. We present a library of analytical tools called Thunder built on the open-source Apache Spark platform for large-scale distributed computing. We demonstrate how these analyses find structure in large-scale neural data, including whole-brain light-sheet imaging data from fictively behaving larval zebrafish, and two-photon imaging data from behaving mouse. The analyses relate neuronal responses to sensory input and behavior, run in minutes or less and can be used on a private cluster or in the cloud. Our open-source framework thus holds promise for turning brain activity mapping efforts into biological insights.

in neural data sets. The researchers were able to benefit from Spark’s distributed and fault tolerant processing capabilities combined with support for processing streaming data to analyze Zebrafish neural activity in real time.
thunder

Neural data processing pipeline Reference Thunder Talk

The framework used for this work is open sourced, called Thunder. We are now looking in to test it out with support of researchers who are in need to analyze large scale neural data set.

Summary

The fields which can benefit from processing large data sets efficiently are not limited to Bioinformatics only. For example Computational Linguistics has data sets available from Common crawl and Amazon in size of 500+ TB. Similarly Satellite datasets from Amazon is also of size 200+TB. To process such large datasets efficiently with advanced analytics e.g. machine learning, we need methods to process data locally and has technologies which enables us to do it. Apache Spark is one, but in future there might be more and we at Uninett & Simga2 are working with researchers to support them to handle the rise of big data with modern methods.

If you or your research group affiliated with universities in Norway and would like to test out these technologies or similar big data technologies then do not hesitate to contact us. Currently It is available to all researchers in Norway without any cost.

 

Evolving Infrastructure for Data Intensive Computing

In earlier post I talked about Data Analytics as Service (DaaS) service, here we will discuss thoughts on how we can provide such a service with current modern data center technologies.

Tl;Dr

The container movement has taken whole data center approach by surprise. The main reason for this has been performance benefit, flexibility and reproducibility. The management of containers is an ongoing challenge and this article tries to cover main technologies in this area ranging from resource management, storage, networking and service discovery.

Long Story

To provide any service in computing world (may be in real universe too [1], story for another time) you need Storage, Computation, Network and Discoverability. The question becomes as how we are going to arrange these components. On top of that we have to support multiple users from multiple tenants. So we end up creating a virtualized environment for each tenant. Later the question becomes at which level we will create virtualization. There is an interesting talk from Byran Cantrill, Joyent discussing about virtualization at different levels from

Hardware -> OS -> Application 

It turns out that the virtualization provided at OS level results in better flexibility as well as performance.

So if we decided to use OS as an abstraction level then the KVM, XEN or any other kind of Hypervisors are out of the question as they result in multiple OS running separately without sharing underlying OS and being lied by Hypervisor about being alone. The picture [2] below shows the different types of Hypervisors.

Hypervisor types

The benefit of using multiple OS is clear isolation among tenants and security. But you suffer in terms of performance as you can not utilize unused resources from underlying host and not having single OS restricts the optimizations in scheduling for CPU and memory. Joyent has done comparison [3] of running containers in their cloud service based on OS level virtualization and Hypervisor level on Amazon AWS. There was no surprise in results as OS level virtualization wins by all means. The issues with current Linux containers is the security concerns. Work has been done on Docker, Rocket and in the Kernel itself to harden it, drop unnecessary capabilities from containers by default.

So now we agree that OS level virtualization is the way to go, the question becomes how we are going to arrange the components to provide an elastic/scalable/fault tolerant service with low system-to-administrator ratio (nice article from about such principles [4]). It is becoming clear that unless you work at Google/Twitter/Facebook/Microsoft, your private cluster will not have capacity to handle your needs all the time. So you will need technologies which allows you to combine resources from public clouds to become as hybrid cloud. I work at the Norwegian NREN and to provide the DaaS service, we need to be flexible and able to combine resources from other providers.

Resource Management

There are four technologies: Openstack, SmartOS Datacenter, Mesos and Kubernetes considered in this article. These technologies are to manage data center resources and provide easier access to users.  Some of them (all except openstack) make whole datacenter appear more like a single large system. I will compare them on the following metrics

Elasticity, Scalability, Fault tolerance, Multi tenancy, Hybrid cloud, System-to-administrator (S/A) ratio

 

Openstack provides resources in terms of virtual machines to users  using hypervisor e.g. KVM underneath. (Although I should not add openstack to comparison because of hypervisor use, but recent efforts on adding docker support make me to add this to our modern datacenter comparison).

OpenStack
Elasticity OK to scale VMs up & down but not as quick as others like Mesos/Kubernetes
Scalability OK for 100s of machines but for larger it becomes harder to manage and scale
Fault Tolerance Possible to setup HA Openstack but requires a larger effort. It is only for Openstack itself but not for VMs as they can not relocated automatically to other compute nodes which is a big drawback in a large data center.
Hybrid Cloud Not supported as abstraction level is not right
Multi Tenancy Good support
S/A Ratio It require high S/A ratio and has been notoriously famous for being hard to upgrade

 

Mesos is a resource manager for your whole cluster. The motto of Mesos is the kernel of your data center. It exposes whole data center as one single computer and provides access to resources in terms of CPU, RAM, Disk. Thus asks users to specify how much resources they need rather than how many VMs they need. As in the end user care about resources to run his/her application not VMs. It is used extensively in many companies e.g. Twitter, Airbnb. It uses frameworks e.g. Apache Spark, Marathon etc which talk to Mesos master to ask for resources.

Mesos
Elasticity Easy to scale your applications as you can simply ask for more resources
Scalability Known to be run on 10000+ machines at twitter
Fault Tolerance It uses Zookeeper to have quorum of master to provide HA support. Also it can reschedule users application automatically upon failure of nodes.
Hybrid Cloud Its possible to combine resources from multiple clouds. As you can have multiple frameworks instances talking to different Mesos masters in private/public clouds.
Multi Tenancy Not good support yet. It is possible but isolation and security of underlying containers is still in question.
S/A Ratio Low as 2-3 people can manage 10000+ machines

Kubernetes is another project to manage cluster running containers as pods. It uses similar principles to Mesos as to provide resources not in terms of VMs. It is a new project and based on experiences from Google Borg [5]. The main difference between Kubernetes and Mesos is that Kubenetes everything is container where as in Mesos you have frameworks e.g. Apache Spark, Marathon which underneath uses containers. So Mesos adds one more layer of abstraction. Check out interesting talk from Tim Hockin on Kubernetes.

Kubernetes
Elasticity Easy to scale your applications as you can simply ask for more resources for yours pods or create more pods with replication controller
Scalability Its a new project but if it will resemble Borg then its going to support thousands of machines
Fault Tolerance Kubernetes is currently not a HA system but there is work going on to add support for it. It has support for rescheduling users applications automatically upon failure of nodes.
Hybrid Cloud Its possible to combine resources from multiple clouds. As it is their motto too on the front page to support private/public and hybrid clouds.
Multi Tenancy Not good support yet. Looking at the roadmap things will change in this area.
S/A Ratio Low as 2-3 people can manage large number of machines

SmartDataCenter (SDC) is from Joyent based on SmartOS. It provides container based service “Triton” in pubic/private cluster using Zones which are production ready. The main benefit is that it uses a technology which is battle tested and ready for production. Downside is that it is based on Open Solaris kernel, so if you need kernel modules that are available only for Linux then you are out of luck. But running Linux userspace applications are fine, as SDC has LX branded Zones to run Linux applications.

SmartDataCenter
Elasticity Easy to scale applications and claims online resizing of resources.
Scalability Been known to run on large cluster (1000+) and providing public cloud service
Fault Tolerance SDC architecture uses a headnode from where all machines boot up in your cluster. As SmartOS is only RAM based OS so headnode can be SPoF in case of power failure.
Hybrid Cloud Not clear from there documentation yet how good support is there for combining private and their public cloud.
Multi Tenancy Good support for it as Zones are well isolated and provide good security to have multiple tenants.
S/A Ratio Low as 2-3 people can manage large number of machines.

Storage

For storage we need a scale out storage system which enables processing of data where it is stored. As the data becomes “big data”, moving it around is not really a solution. The principle to store large data set is to divide it in smaller blocks and store it on multiple servers for performance and reliability. Using this principle, you can have object storage service e.g. Amazon S3 for storage or you add layers on top of it to provide block or file system too using these underlying objects e.g. HDFS for file system.

In open source currently Ceph is one of the main viable candidates for scale out storage. Ceph provides object, block and file system which satisfy above mentioned criteria. The maturity of Ceph varies from type of storage, object and block are stable (Yahoo’s object store) where as file system is not. It will take may be a year more to get file system stable enough to be useful. Mesos and Kubernetes both trying to add support for persistence layer in their resource management part. This will enable running storage components itself inside normal resource managers and lowers the administration effort to manage storage.

Another option is the SDC’s object storage component called Manta. Manta is better than Amazon S3 as you can run computation as well on the data where it is stored. Manta embraces data locality to handle big data but does not have Hadoop bindings (embrace Hadoop guys similarly you did with LX branded zones for Linux).  By having Hadoop bindings, the world of Hadoop will be open to Manta as well, cause you do need to support for distributed machine learning algorithms to go beyond basic word counting. Of course we can run HDFS on top of normal ZFS which SDC uses, then we have to duplicate our Big Data!! Another issue with SDC is the lack of POSIX like distributed file system, it has NFS interface to Manta. But NFS implementation does not have locking support. So it is hard to mount it on multiple containers and run MPI applications or other tools which needs distributed file system. We can run Ceph on ZFS again but for CephFS you need kernel module which is not available on SmartOS till yet.

Providing distributed scalable storage is not an easy challenge. Its all about trade offs, remember the CAP theorem. There is a lot of work going on in this area but there is no clear winner here till yet.

Networking

To build a scale out architecture, we need to connect these containers to each other. The default model of Docker is that each container gets an IP from a bridge docker0. As this is local to each machine, so containers running on different machines can get the same IP address which makes communication between containers hard.

To solve this we need to give each container a unique IP. To do that there as mainly two design options

  • Overlay network: Create a VLAN overlay on physical network and give each container a unique IP address. There are few projects which has chosen this path. OpenvSwitch, Weave, Flannel all uses this method. For example Weave makes it easy to combine containers running in different cloud providers by creating overlay between them. The main concern is that we get a performance hit by doing so. You can use Multicast too in these overlay network e.g in Weave Multicast.
  • No Overlay: A project name Calico is trying to solve the container networking using BGP and Linux ACL. It does not create an overlay network but simply use L3 routing and ACLs to isolate traffic between containers.

With support of IPv6 in docker and other container runtime (rocket from CoreOS, Zones from SDC), there is going to be an IP address for each container, questions remains is using which method. There has not been much documentations on performance of either methods, at the same time approaches are evolving at faster pace to mature. There is Software Defined Networking (SDN) to watch out too. As how we can combine SDN with above mentioned projects to configure network according to demand.

Service Discovery

You have got your cluster with containers dynamically assigned to nodes. Now how you are going to find service X, you need to know on which server service X is running and which port. You will need to configure your load balancer too with IP address of newer service X instance.

Enter the world of service discovery for modern data center. Most of the projects in this area support DNS for discovery with few has support for HTTP based key/value in addition.

  • Single Datacenter: Mesos-DNS is a DNS based service discovery to be used together with Mesos. SkyDNS is service discovery mechanism based on etcd.
  • Multi Datacenter: Consul is a service discovery tool for inter cluster. It is based on etcd as well. It also support service health checkup too

There might be more projects doing service discovery as the container area is very hot and evolving. But it seems Consul is very good candidate to realize service discovery in hybrid cloud infrastructure.

Summary

To provide an infrastructure to handle the challenges for coming years with big data from genomes, sensors, videos we need a scale out infrastructure with fault tolerance and automatic healing. Technologies to run on such infrastructure needs to support data locality as moving data around is not a solution.

There is no clear winner when it comes resource managers. Openstack is bagged with legacy thinking (VMs), no support for Hybrid clouds and low systems-to-administrator ratio. Mesos embraces modern thinking in terms of resources/applications not VMs but started before containers movement went bananas. Therefore it has another layer of abstraction called frameworks. Although frameworks underneath uses Cgroups for isolation, but container movement is not only about isolations. Kubernetes might get this container movement right as lessons are learned from Google Borg (running containers for 5+ years in production). Joyent SDC has been around containers for 8+ years and has production ready system when it comes to resource manager and networking. But Storage layer still leaves you with a feeling of missing some crucial parts. For us a distributed file system is a must, being a NREN and supercomputing organization, as researchers will need to run variety of workloads and will require a scale out data storage layer to store and process data with data locality support.

I have not covered new container OSes (CoreOS, RancherOS, Atomic, Photon) to run Kubernetes/Mesos in this post. Also this post did not cover PaaS solutions (Deis, Openshift, Marathon, Aurora).

Thanks to my colleagues Morten Knutsen & Olav Kvittem for interesting discussion and feedback on the article.

References:

[1]: Programming the Universe: A Quantum Computer Scientist Takes on the Cosmos

[2]: http://en.wikipedia.org/wiki/Hypervisor

[3]: https://www.joyent.com/blog/docker-bake-off-aws-vs-joyent

[4]: On Designing and Deploying Internet-Scale Services

[5]: Large-scale cluster management at Google with Borg

Data Analysis as Service

The goals for Data Analysis as Service (DaaS):

  •  A service to enable processing of large data sets, so called Big Data, in parallel using data locality principle
  • Able to share research data as well as processing pipeline with corresponding research communities.

Imagine you are reading a scientific paper and thought about some modifications which you would want to test out in the current published paper. It would be great if the author provides you a link to a portal and you can rerun the whole analysis with your modifications on a subset or full datasets. Thus research will progress at a faster pace and results will be reproducible and collaboration between researchers can become much easier.

Currently researchers focus on sharing data with other researchers for collaboration or publications. As this enables other researchers to reproduce their results as well as to use their data for future research. The challenges with data analysis is that preparing data for analysis is a big part of analysis and poses a significant challenge in itself. DaaS will enable researchers to not only share their datasets but also the whole processing pipeline. This will enable researchers to collaborate at rate and ease which is not possible currently.

In recent years, the amount of data generated has been increasing exponentially. The data is coming from different sources such as gene sequencing, protein sequencing, neuroscience, sensor networks, network flows, social media, machine logs, satellite images etc. Researchers would like to analyze this data sets with advanced algorithms e.g. machine learning without worrying about the scale of data sets. The current method of using one/few high-end machine or traditional super computing has problem in handling big data. Simply to move 100 TB of data on 10 Gbps link, it takes 22.7 HOURS with full bandwidth. So we need a service where we can process data where it is stored and thus enable an efficient and scalable way of processing large data sets.

Of course you can use Amazon or Google cloud for it. But problems arises as data going out of country, jurisdiction/control (if you think it is not an issue, recommend reading [1]) and how cost will be covered. Moreover simple Amazon/Google cloud are not so easy to use either, there are large number of startups which are build around making them user friendly. Being a part of NREN [2] and national supercomputing in Norway, we are working on providing such a service to our researchers. We are also collaborating with our Nordic partners under Glenna project [3]. This will enable us to share the resources among Nordic countries and build knowledge together.

References:

[1] Who own the future

[2] National Research and Education Network

[2] Glenna NeIC Project

Hadoop yarn (2.2.0) setup with Ceph

In this post we will try to make Yarn use Ceph rather than HDFS as a file system. We assume that you already have Yarn cluster setup, if not then you can follow nice guide from Cloudera, you don’t need to install HDFS components. We also assume that you have Ceph cluster up and running, if not then you can follow nice how to guide from Ceph Docs. You should install the yarn nodemanagers on each hosts in your cluster where Ceph osds are running. Once you have Yarn and Ceph clusters up and running, follow these steps to make Yarn to start using Ceph.

  • Install following packages on each ceph/nodemanager nodes
  • apt-get install libcephfs-java libcephfs-jni
  • Once installed on each host, copy the ceph java lib to hadoop lib folders as
  • cp /usr/share/java/libcephfs-0.72.2.jar /usr/lib/hadoop/lib/
    cp /usr/lib/jni/libcephfs_jni.so* /usr/lib/hadoop/lib/native
  • You also will need to install the cephfs-hadoop plugin jar by compiling this git repo including the patch which add support for Hadoop 2. You can download the compiled jar from here which works with Cdh5 beta2.
  • Once you have build/downloaded the ceph-hadoop plugin jar copy it to hadoop lib folder as
  • cp ~/cephfs-hadoop-1.0-SNAPSHOT.jar /usr/lib/hadoop/lib/
  • Now we have all the libraries we need for Hadoop to run with ceph. Now its the time to setup the configuration in core-site.xml file to make hadoop use ceph. Make sure you have admin.secret file at the path you mentioning in the config options. Add the following options in the config file.
  • <property>
     <name>fs.defaultFS</name>
     <value>ceph://mon-host:6789/</value>
    </property>
    <property>
    <name>ceph.conf.file</name>
     <value>/etc/ceph/ceph.conf</value>
    </property>
    
    <property>
     <name>ceph.auth.id</name>
     <value>admin</value>
    </property>
    
    <property>
     <name>ceph.auth.keyfile</name>
     <value>/etc/hadoop/conf/admin.secret</value>
    </property>
    
    <property>
     <name>fs.ceph.impl</name>
     <value>org.apache.hadoop.fs.ceph.CephFileSystem</value>
    </property>
    
    <property>
     <name>fs.AbstractFileSystem.ceph.impl</name>
     <value>org.apache.hadoop.fs.ceph.CephHadoop2FileSystem</value>
    </property>
    
    <property>
     <name>ceph.object.size</name>
     <value>67108864</value>
    </property>
  • You do need to create the yarn and mapreduce history directories with correct permissions, similar to you would have done for HDFS. Now you can run the following command and see the contents of Ceph “/” path.
  • # hdfs dfs -ls /
    Found 3 items
    drw-r--r-- - yarn 209715222774 2014-04-04 15:24 /benchmarks
    drwx------ - yarn 40390852613 2014-04-04 13:59 /user
    drwxrwxrwt - yarn 20298878 2014-04-04 13:59 /var
  • Now you have Hadoop Yarn running with Ceph. Enjoy :-)

Data Analysis as a Service: Project motivation and abstract

In recent years, the amount of data generated has been increasing exponentially. The data is coming from different sources such as machine logs, gene sequencing, sensor networks, network flows, social media. Researchers in education and research sector from areas e.g. Bioinformatics, computer science, astronomy, environmental science has huge data sets and would like to analyze this data without worrying about the scale of data sets. Thus there is an increasing demand of getting this data to work by storing and processing it in a horizontally scalable way. In recent years there has been a rise of commercially backed distributed open source softwares by the main global actors e.g. Google, Yahoo, Twitter and Facebook. These distributed softwares utilize commodity hardware to store the big data and provide ability to process it locally, thus providing good economy of scale.

Data Analysis as a Service (DaaS) project will investigate the possibility of providing a common infrastructure to researchers where they can store and process their data using advanced algorithms at big scale. In this way, we can contribute in building an Eco-system where researchers can analyze their big data sets and able to share not just data but also the whole processing pipelines. This it will provide a great possibility to collaborate with researchers across different institutions/nations. Moreover researchers can help each other in evolving the Eco-system by adding new functionality and thus improving research and DaaS platform