Service Infrastructure

Since all modern ICT services can be accessed via the Internet and / or require internet access, UNINETT continuous monitorareas of technology and standards that may be relevant and important for online services. Exploitation and provisioning of network and system resources for services is one of the core focus areas for service infrastructure work.

Hadoop yarn (2.2.0) setup with Ceph

In this post we will try to make Yarn use Ceph rather than HDFS as a file system. We assume that you already have Yarn cluster setup, if not then you can follow nice guide from Cloudera, you don’t need to install HDFS components. We also assume that you have Ceph cluster up and running, if not then you can follow nice how to guide from Ceph Docs. You should install the yarn nodemanagers on each hosts in your cluster where Ceph osds are running. Once you have Yarn and Ceph clusters up and running, follow these steps to make Yarn to start using Ceph.

  • Install following packages on each ceph/nodemanager nodes
  • apt-get install libcephfs-java libcephfs-jni
  • Once installed on each host, copy the ceph java lib to hadoop lib folders as
  • cp /usr/share/java/libcephfs-0.72.2.jar /usr/lib/hadoop/lib/
    cp /usr/lib/jni/libcephfs_jni.so* /usr/lib/hadoop/lib/native
  • You also will need to install the cephfs-hadoop plugin jar by compiling this git repo including the patch which add support for Hadoop 2. You can download the compiled jar from here which works with Cdh5 beta2.
  • Once you have build/downloaded the ceph-hadoop plugin jar copy it to hadoop lib folder as
  • cp ~/cephfs-hadoop-1.0-SNAPSHOT.jar /usr/lib/hadoop/lib/
  • Now we have all the libraries we need for Hadoop to run with ceph. Now its the time to setup the configuration in core-site.xml file to make hadoop use ceph. Make sure you have admin.secret file at the path you mentioning in the config options. Add the following options in the config file.
  • <property>
     <name>fs.defaultFS</name>
     <value>ceph://mon-host:6789/</value>
    </property>
    <property>
    <name>ceph.conf.file</name>
     <value>/etc/ceph/ceph.conf</value>
    </property>
    
    <property>
     <name>ceph.auth.id</name>
     <value>admin</value>
    </property>
    
    <property>
     <name>ceph.auth.keyfile</name>
     <value>/etc/hadoop/conf/admin.secret</value>
    </property>
    
    <property>
     <name>fs.ceph.impl</name>
     <value>org.apache.hadoop.fs.ceph.CephFileSystem</value>
    </property>
    
    <property>
     <name>fs.AbstractFileSystem.ceph.impl</name>
     <value>org.apache.hadoop.fs.ceph.CephHadoop2FileSystem</value>
    </property>
    
    <property>
     <name>ceph.object.size</name>
     <value>67108864</value>
    </property>
  • You do need to create the yarn and mapreduce history directories with correct permissions, similar to you would have done for HDFS. Now you can run the following command and see the contents of Ceph “/” path.
  • # hdfs dfs -ls /
    Found 3 items
    drw-r--r-- - yarn 209715222774 2014-04-04 15:24 /benchmarks
    drwx------ - yarn 40390852613 2014-04-04 13:59 /user
    drwxrwxrwt - yarn 20298878 2014-04-04 13:59 /var
  • Now you have Hadoop Yarn running with Ceph. Enjoy :-)

Data Analysis as a Service: Project motivation and abstract

In recent years, the amount of data generated has been increasing exponentially. The data is coming from different sources such as machine logs, gene sequencing, sensor networks, network flows, social media. Researchers in education and research sector from areas e.g. Bioinformatics, computer science, astronomy, environmental science has huge data sets and would like to analyze this data without worrying about the scale of data sets. Thus there is an increasing demand of getting this data to work by storing and processing it in a horizontally scalable way. In recent years there has been a rise of commercially backed distributed open source softwares by the main global actors e.g. Google, Yahoo, Twitter and Facebook. These distributed softwares utilize commodity hardware to store the big data and provide ability to process it locally, thus providing good economy of scale.

Data Analysis as a Service (DaaS) project will investigate the possibility of providing a common infrastructure to researchers where they can store and process their data using advanced algorithms at big scale. In this way, we can contribute in building an Eco-system where researchers can analyze their big data sets and able to share not just data but also the whole processing pipelines. This it will provide a great possibility to collaborate with researchers across different institutions/nations. Moreover researchers can help each other in evolving the Eco-system by adding new functionality and thus improving research and DaaS platform