In earlier post I talked about Data Analytics as Service (DaaS) service, here we will discuss thoughts on how we can provide such a service with current modern data center technologies.
The container movement has taken whole data center approach by surprise. The main reason for this has been performance benefit, flexibility and reproducibility. The management of containers is an ongoing challenge and this article tries to cover main technologies in this area ranging from resource management, storage, networking and service discovery.
To provide any service in computing world (may be in real universe too , story for another time) you need Storage, Computation, Network and Discoverability. The question becomes as how we are going to arrange these components. On top of that we have to support multiple users from multiple tenants. So we end up creating a virtualized environment for each tenant. Later the question becomes at which level we will create virtualization. There is an interesting talk from Byran Cantrill, Joyent discussing about virtualization at different levels from
Hardware -> OS -> Application
It turns out that the virtualization provided at OS level results in better flexibility as well as performance.
So if we decided to use OS as an abstraction level then the KVM, XEN or any other kind of Hypervisors are out of the question as they result in multiple OS running separately without sharing underlying OS and being lied by Hypervisor about being alone. The picture  below shows the different types of Hypervisors.
The benefit of using multiple OS is clear isolation among tenants and security. But you suffer in terms of performance as you can not utilize unused resources from underlying host and not having single OS restricts the optimizations in scheduling for CPU and memory. Joyent has done comparison  of running containers in their cloud service based on OS level virtualization and Hypervisor level on Amazon AWS. There was no surprise in results as OS level virtualization wins by all means. The issues with current Linux containers is the security concerns. Work has been done on Docker, Rocket and in the Kernel itself to harden it, drop unnecessary capabilities from containers by default.
So now we agree that OS level virtualization is the way to go, the question becomes how we are going to arrange the components to provide an elastic/scalable/fault tolerant service with low system-to-administrator ratio (nice article from about such principles ). It is becoming clear that unless you work at Google/Twitter/Facebook/Microsoft, your private cluster will not have capacity to handle your needs all the time. So you will need technologies which allows you to combine resources from public clouds to become as hybrid cloud. I work at the Norwegian NREN and to provide the DaaS service, we need to be flexible and able to combine resources from other providers.
There are four technologies: Openstack, SmartOS Datacenter, Mesos and Kubernetes considered in this article. These technologies are to manage data center resources and provide easier access to users. Some of them (all except openstack) make whole datacenter appear more like a single large system. I will compare them on the following metrics
Elasticity, Scalability, Fault tolerance, Multi tenancy, Hybrid cloud, System-to-administrator (S/A) ratio
Openstack provides resources in terms of virtual machines to users using hypervisor e.g. KVM underneath. (Although I should not add openstack to comparison because of hypervisor use, but recent efforts on adding docker support make me to add this to our modern datacenter comparison).
|Elasticity||OK to scale VMs up & down but not as quick as others like Mesos/Kubernetes|
|Scalability||OK for 100s of machines but for larger it becomes harder to manage and scale|
|Fault Tolerance||Possible to setup HA Openstack but requires a larger effort. It is only for Openstack itself but not for VMs as they can not relocated automatically to other compute nodes which is a big drawback in a large data center.|
|Hybrid Cloud||Not supported as abstraction level is not right|
|Multi Tenancy||Good support|
|S/A Ratio||It require high S/A ratio and has been notoriously famous for being hard to upgrade|
Mesos is a resource manager for your whole cluster. The motto of Mesos is the kernel of your data center. It exposes whole data center as one single computer and provides access to resources in terms of CPU, RAM, Disk. Thus asks users to specify how much resources they need rather than how many VMs they need. As in the end user care about resources to run his/her application not VMs. It is used extensively in many companies e.g. Twitter, Airbnb. It uses frameworks e.g. Apache Spark, Marathon etc which talk to Mesos master to ask for resources.
|Elasticity||Easy to scale your applications as you can simply ask for more resources|
|Scalability||Known to be run on 10000+ machines at twitter|
|Fault Tolerance||It uses Zookeeper to have quorum of master to provide HA support. Also it can reschedule users application automatically upon failure of nodes.|
|Hybrid Cloud||Its possible to combine resources from multiple clouds. As you can have multiple frameworks instances talking to different Mesos masters in private/public clouds.|
|Multi Tenancy||Not good support yet. It is possible but isolation and security of underlying containers is still in question.|
|S/A Ratio||Low as 2-3 people can manage 10000+ machines|
Kubernetes is another project to manage cluster running containers as pods. It uses similar principles to Mesos as to provide resources not in terms of VMs. It is a new project and based on experiences from Google Borg . The main difference between Kubernetes and Mesos is that Kubenetes everything is container where as in Mesos you have frameworks e.g. Apache Spark, Marathon which underneath uses containers. So Mesos adds one more layer of abstraction. Check out interesting talk from Tim Hockin on Kubernetes.
|Elasticity||Easy to scale your applications as you can simply ask for more resources for yours pods or create more pods with replication controller|
|Scalability||Its a new project but if it will resemble Borg then its going to support thousands of machines|
|Fault Tolerance||Kubernetes is currently not a HA system but there is work going on to add support for it. It has support for rescheduling users applications automatically upon failure of nodes.|
|Hybrid Cloud||Its possible to combine resources from multiple clouds. As it is their motto too on the front page to support private/public and hybrid clouds.|
|Multi Tenancy||Not good support yet. Looking at the roadmap things will change in this area.|
|S/A Ratio||Low as 2-3 people can manage large number of machines|
SmartDataCenter (SDC) is from Joyent based on SmartOS. It provides container based service “Triton” in pubic/private cluster using Zones which are production ready. The main benefit is that it uses a technology which is battle tested and ready for production. Downside is that it is based on Open Solaris kernel, so if you need kernel modules that are available only for Linux then you are out of luck. But running Linux userspace applications are fine, as SDC has LX branded Zones to run Linux applications.
|Elasticity||Easy to scale applications and claims online resizing of resources.|
|Scalability||Been known to run on large cluster (1000+) and providing public cloud service|
|Fault Tolerance||SDC architecture uses a headnode from where all machines boot up in your cluster. As SmartOS is only RAM based OS so headnode can be SPoF in case of power failure.|
|Hybrid Cloud||Not clear from there documentation yet how good support is there for combining private and their public cloud.|
|Multi Tenancy||Good support for it as Zones are well isolated and provide good security to have multiple tenants.|
|S/A Ratio||Low as 2-3 people can manage large number of machines.|
For storage we need a scale out storage system which enables processing of data where it is stored. As the data becomes “big data”, moving it around is not really a solution. The principle to store large data set is to divide it in smaller blocks and store it on multiple servers for performance and reliability. Using this principle, you can have object storage service e.g. Amazon S3 for storage or you add layers on top of it to provide block or file system too using these underlying objects e.g. HDFS for file system.
In open source currently Ceph is one of the main viable candidates for scale out storage. Ceph provides object, block and file system which satisfy above mentioned criteria. The maturity of Ceph varies from type of storage, object and block are stable (Yahoo’s object store) where as file system is not. It will take may be a year more to get file system stable enough to be useful. Mesos and Kubernetes both trying to add support for persistence layer in their resource management part. This will enable running storage components itself inside normal resource managers and lowers the administration effort to manage storage.
Another option is the SDC’s object storage component called Manta. Manta is better than Amazon S3 as you can run computation as well on the data where it is stored. Manta embraces data locality to handle big data but does not have Hadoop bindings (embrace Hadoop guys similarly you did with LX branded zones for Linux). By having Hadoop bindings, the world of Hadoop will be open to Manta as well, cause you do need to support for distributed machine learning algorithms to go beyond basic word counting. Of course we can run HDFS on top of normal ZFS which SDC uses, then we have to duplicate our Big Data!! Another issue with SDC is the lack of POSIX like distributed file system, it has NFS interface to Manta. But NFS implementation does not have locking support. So it is hard to mount it on multiple containers and run MPI applications or other tools which needs distributed file system. We can run Ceph on ZFS again but for CephFS you need kernel module which is not available on SmartOS till yet.
Providing distributed scalable storage is not an easy challenge. Its all about trade offs, remember the CAP theorem. There is a lot of work going on in this area but there is no clear winner here till yet.
To build a scale out architecture, we need to connect these containers to each other. The default model of Docker is that each container gets an IP from a bridge docker0. As this is local to each machine, so containers running on different machines can get the same IP address which makes communication between containers hard.
To solve this we need to give each container a unique IP. To do that there as mainly two design options
- Overlay network: Create a VLAN overlay on physical network and give each container a unique IP address. There are few projects which has chosen this path. OpenvSwitch, Weave, Flannel all uses this method. For example Weave makes it easy to combine containers running in different cloud providers by creating overlay between them. The main concern is that we get a performance hit by doing so. You can use Multicast too in these overlay network e.g in Weave Multicast.
- No Overlay: A project name Calico is trying to solve the container networking using BGP and Linux ACL. It does not create an overlay network but simply use L3 routing and ACLs to isolate traffic between containers.
With support of IPv6 in docker and other container runtime (rocket from CoreOS, Zones from SDC), there is going to be an IP address for each container, questions remains is using which method. There has not been much documentations on performance of either methods, at the same time approaches are evolving at faster pace to mature. There is Software Defined Networking (SDN) to watch out too. As how we can combine SDN with above mentioned projects to configure network according to demand.
You have got your cluster with containers dynamically assigned to nodes. Now how you are going to find service X, you need to know on which server service X is running and which port. You will need to configure your load balancer too with IP address of newer service X instance.
Enter the world of service discovery for modern data center. Most of the projects in this area support DNS for discovery with few has support for HTTP based key/value in addition.
- Single Datacenter: Mesos-DNS is a DNS based service discovery to be used together with Mesos. SkyDNS is service discovery mechanism based on etcd.
- Multi Datacenter: Consul is a service discovery tool for inter cluster. It is based on etcd as well. It also support service health checkup too
There might be more projects doing service discovery as the container area is very hot and evolving. But it seems Consul is very good candidate to realize service discovery in hybrid cloud infrastructure.
To provide an infrastructure to handle the challenges for coming years with big data from genomes, sensors, videos we need a scale out infrastructure with fault tolerance and automatic healing. Technologies to run on such infrastructure needs to support data locality as moving data around is not a solution.
There is no clear winner when it comes resource managers. Openstack is bagged with legacy thinking (VMs), no support for Hybrid clouds and low systems-to-administrator ratio. Mesos embraces modern thinking in terms of resources/applications not VMs but started before containers movement went bananas. Therefore it has another layer of abstraction called frameworks. Although frameworks underneath uses Cgroups for isolation, but container movement is not only about isolations. Kubernetes might get this container movement right as lessons are learned from Google Borg (running containers for 5+ years in production). Joyent SDC has been around containers for 8+ years and has production ready system when it comes to resource manager and networking. But Storage layer still leaves you with a feeling of missing some crucial parts. For us a distributed file system is a must, being a NREN and supercomputing organization, as researchers will need to run variety of workloads and will require a scale out data storage layer to store and process data with data locality support.
Thanks to my colleagues Morten Knutsen & Olav Kvittem for interesting discussion and feedback on the article.