Evolving Infrastructure for Data Intensive Computing

In earlier post I talked about Data Analytics as Service (DaaS) service, here we will discuss thoughts on how we can provide such a service with current modern data center technologies.

Tl;Dr

The container movement has taken whole data center approach by surprise. The main reason for this has been performance benefit, flexibility and reproducibility. The management of containers is an ongoing challenge and this article tries to cover main technologies in this area ranging from resource management, storage, networking and service discovery.

Long Story

To provide any service in computing world (may be in real universe too [1], story for another time) you need Storage, Computation, Network and Discoverability. The question becomes as how we are going to arrange these components. On top of that we have to support multiple users from multiple tenants. So we end up creating a virtualized environment for each tenant. Later the question becomes at which level we will create virtualization. There is an interesting talk from Byran Cantrill, Joyent discussing about virtualization at different levels from

Hardware -> OS -> Application 

It turns out that the virtualization provided at OS level results in better flexibility as well as performance.

So if we decided to use OS as an abstraction level then the KVM, XEN or any other kind of Hypervisors are out of the question as they result in multiple OS running separately without sharing underlying OS and being lied by Hypervisor about being alone. The picture [2] below shows the different types of Hypervisors.

Hypervisor types

The benefit of using multiple OS is clear isolation among tenants and security. But you suffer in terms of performance as you can not utilize unused resources from underlying host and not having single OS restricts the optimizations in scheduling for CPU and memory. Joyent has done comparison [3] of running containers in their cloud service based on OS level virtualization and Hypervisor level on Amazon AWS. There was no surprise in results as OS level virtualization wins by all means. The issues with current Linux containers is the security concerns. Work has been done on Docker, Rocket and in the Kernel itself to harden it, drop unnecessary capabilities from containers by default.

So now we agree that OS level virtualization is the way to go, the question becomes how we are going to arrange the components to provide an elastic/scalable/fault tolerant service with low system-to-administrator ratio (nice article from about such principles [4]). It is becoming clear that unless you work at Google/Twitter/Facebook/Microsoft, your private cluster will not have capacity to handle your needs all the time. So you will need technologies which allows you to combine resources from public clouds to become as hybrid cloud. I work at the Norwegian NREN and to provide the DaaS service, we need to be flexible and able to combine resources from other providers.

Resource Management

There are four technologies: Openstack, SmartOS Datacenter, Mesos and Kubernetes considered in this article. These technologies are to manage data center resources and provide easier access to users.  Some of them (all except openstack) make whole datacenter appear more like a single large system. I will compare them on the following metrics

Elasticity, Scalability, Fault tolerance, Multi tenancy, Hybrid cloud, System-to-administrator (S/A) ratio

 

Openstack provides resources in terms of virtual machines to users  using hypervisor e.g. KVM underneath. (Although I should not add openstack to comparison because of hypervisor use, but recent efforts on adding docker support make me to add this to our modern datacenter comparison).

OpenStack
Elasticity OK to scale VMs up & down but not as quick as others like Mesos/Kubernetes
Scalability OK for 100s of machines but for larger it becomes harder to manage and scale
Fault Tolerance Possible to setup HA Openstack but requires a larger effort. It is only for Openstack itself but not for VMs as they can not relocated automatically to other compute nodes which is a big drawback in a large data center.
Hybrid Cloud Not supported as abstraction level is not right
Multi Tenancy Good support
S/A Ratio It require high S/A ratio and has been notoriously famous for being hard to upgrade

 

Mesos is a resource manager for your whole cluster. The motto of Mesos is the kernel of your data center. It exposes whole data center as one single computer and provides access to resources in terms of CPU, RAM, Disk. Thus asks users to specify how much resources they need rather than how many VMs they need. As in the end user care about resources to run his/her application not VMs. It is used extensively in many companies e.g. Twitter, Airbnb. It uses frameworks e.g. Apache Spark, Marathon etc which talk to Mesos master to ask for resources.

Mesos
Elasticity Easy to scale your applications as you can simply ask for more resources
Scalability Known to be run on 10000+ machines at twitter
Fault Tolerance It uses Zookeeper to have quorum of master to provide HA support. Also it can reschedule users application automatically upon failure of nodes.
Hybrid Cloud Its possible to combine resources from multiple clouds. As you can have multiple frameworks instances talking to different Mesos masters in private/public clouds.
Multi Tenancy Not good support yet. It is possible but isolation and security of underlying containers is still in question.
S/A Ratio Low as 2-3 people can manage 10000+ machines

Kubernetes is another project to manage cluster running containers as pods. It uses similar principles to Mesos as to provide resources not in terms of VMs. It is a new project and based on experiences from Google Borg [5]. The main difference between Kubernetes and Mesos is that Kubenetes everything is container where as in Mesos you have frameworks e.g. Apache Spark, Marathon which underneath uses containers. So Mesos adds one more layer of abstraction. Check out interesting talk from Tim Hockin on Kubernetes.

Kubernetes
Elasticity Easy to scale your applications as you can simply ask for more resources for yours pods or create more pods with replication controller
Scalability Its a new project but if it will resemble Borg then its going to support thousands of machines
Fault Tolerance Kubernetes is currently not a HA system but there is work going on to add support for it. It has support for rescheduling users applications automatically upon failure of nodes.
Hybrid Cloud Its possible to combine resources from multiple clouds. As it is their motto too on the front page to support private/public and hybrid clouds.
Multi Tenancy Not good support yet. Looking at the roadmap things will change in this area.
S/A Ratio Low as 2-3 people can manage large number of machines

SmartDataCenter (SDC) is from Joyent based on SmartOS. It provides container based service “Triton” in pubic/private cluster using Zones which are production ready. The main benefit is that it uses a technology which is battle tested and ready for production. Downside is that it is based on Open Solaris kernel, so if you need kernel modules that are available only for Linux then you are out of luck. But running Linux userspace applications are fine, as SDC has LX branded Zones to run Linux applications.

SmartDataCenter
Elasticity Easy to scale applications and claims online resizing of resources.
Scalability Been known to run on large cluster (1000+) and providing public cloud service
Fault Tolerance SDC architecture uses a headnode from where all machines boot up in your cluster. As SmartOS is only RAM based OS so headnode can be SPoF in case of power failure.
Hybrid Cloud Not clear from there documentation yet how good support is there for combining private and their public cloud.
Multi Tenancy Good support for it as Zones are well isolated and provide good security to have multiple tenants.
S/A Ratio Low as 2-3 people can manage large number of machines.

Storage

For storage we need a scale out storage system which enables processing of data where it is stored. As the data becomes “big data”, moving it around is not really a solution. The principle to store large data set is to divide it in smaller blocks and store it on multiple servers for performance and reliability. Using this principle, you can have object storage service e.g. Amazon S3 for storage or you add layers on top of it to provide block or file system too using these underlying objects e.g. HDFS for file system.

In open source currently Ceph is one of the main viable candidates for scale out storage. Ceph provides object, block and file system which satisfy above mentioned criteria. The maturity of Ceph varies from type of storage, object and block are stable (Yahoo’s object store) where as file system is not. It will take may be a year more to get file system stable enough to be useful. Mesos and Kubernetes both trying to add support for persistence layer in their resource management part. This will enable running storage components itself inside normal resource managers and lowers the administration effort to manage storage.

Another option is the SDC’s object storage component called Manta. Manta is better than Amazon S3 as you can run computation as well on the data where it is stored. Manta embraces data locality to handle big data but does not have Hadoop bindings (embrace Hadoop guys similarly you did with LX branded zones for Linux).  By having Hadoop bindings, the world of Hadoop will be open to Manta as well, cause you do need to support for distributed machine learning algorithms to go beyond basic word counting. Of course we can run HDFS on top of normal ZFS which SDC uses, then we have to duplicate our Big Data!! Another issue with SDC is the lack of POSIX like distributed file system, it has NFS interface to Manta. But NFS implementation does not have locking support. So it is hard to mount it on multiple containers and run MPI applications or other tools which needs distributed file system. We can run Ceph on ZFS again but for CephFS you need kernel module which is not available on SmartOS till yet.

Providing distributed scalable storage is not an easy challenge. Its all about trade offs, remember the CAP theorem. There is a lot of work going on in this area but there is no clear winner here till yet.

Networking

To build a scale out architecture, we need to connect these containers to each other. The default model of Docker is that each container gets an IP from a bridge docker0. As this is local to each machine, so containers running on different machines can get the same IP address which makes communication between containers hard.

To solve this we need to give each container a unique IP. To do that there as mainly two design options

  • Overlay network: Create a VLAN overlay on physical network and give each container a unique IP address. There are few projects which has chosen this path. OpenvSwitch, Weave, Flannel all uses this method. For example Weave makes it easy to combine containers running in different cloud providers by creating overlay between them. The main concern is that we get a performance hit by doing so. You can use Multicast too in these overlay network e.g in Weave Multicast.
  • No Overlay: A project name Calico is trying to solve the container networking using BGP and Linux ACL. It does not create an overlay network but simply use L3 routing and ACLs to isolate traffic between containers.

With support of IPv6 in docker and other container runtime (rocket from CoreOS, Zones from SDC), there is going to be an IP address for each container, questions remains is using which method. There has not been much documentations on performance of either methods, at the same time approaches are evolving at faster pace to mature. There is Software Defined Networking (SDN) to watch out too. As how we can combine SDN with above mentioned projects to configure network according to demand.

Service Discovery

You have got your cluster with containers dynamically assigned to nodes. Now how you are going to find service X, you need to know on which server service X is running and which port. You will need to configure your load balancer too with IP address of newer service X instance.

Enter the world of service discovery for modern data center. Most of the projects in this area support DNS for discovery with few has support for HTTP based key/value in addition.

  • Single Datacenter: Mesos-DNS is a DNS based service discovery to be used together with Mesos. SkyDNS is service discovery mechanism based on etcd.
  • Multi Datacenter: Consul is a service discovery tool for inter cluster. It is based on etcd as well. It also support service health checkup too

There might be more projects doing service discovery as the container area is very hot and evolving. But it seems Consul is very good candidate to realize service discovery in hybrid cloud infrastructure.

Summary

To provide an infrastructure to handle the challenges for coming years with big data from genomes, sensors, videos we need a scale out infrastructure with fault tolerance and automatic healing. Technologies to run on such infrastructure needs to support data locality as moving data around is not a solution.

There is no clear winner when it comes resource managers. Openstack is bagged with legacy thinking (VMs), no support for Hybrid clouds and low systems-to-administrator ratio. Mesos embraces modern thinking in terms of resources/applications not VMs but started before containers movement went bananas. Therefore it has another layer of abstraction called frameworks. Although frameworks underneath uses Cgroups for isolation, but container movement is not only about isolations. Kubernetes might get this container movement right as lessons are learned from Google Borg (running containers for 5+ years in production). Joyent SDC has been around containers for 8+ years and has production ready system when it comes to resource manager and networking. But Storage layer still leaves you with a feeling of missing some crucial parts. For us a distributed file system is a must, being a NREN and supercomputing organization, as researchers will need to run variety of workloads and will require a scale out data storage layer to store and process data with data locality support.

I have not covered new container OSes (CoreOS, RancherOS, Atomic, Photon) to run Kubernetes/Mesos in this post. Also this post did not cover PaaS solutions (Deis, Openshift, Marathon, Aurora).

Thanks to my colleagues Morten Knutsen & Olav Kvittem for interesting discussion and feedback on the article.

References:

[1]: Programming the Universe: A Quantum Computer Scientist Takes on the Cosmos

[2]: http://en.wikipedia.org/wiki/Hypervisor

[3]: https://www.joyent.com/blog/docker-bake-off-aws-vs-joyent

[4]: On Designing and Deploying Internet-Scale Services

[5]: Large-scale cluster management at Google with Borg

4 thoughts on “Evolving Infrastructure for Data Intensive Computing

  1. pogo phone support

    Thank you so much for giving pieces of information. POGO online gaming is a perfect platform to make your dream come true. If you are facing any problem with playing pogo game and you want to solve your problems within seconds visit here:

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *