In this post we will try to make Yarn use Ceph rather than HDFS as a file system. We assume that you already have Yarn cluster setup, if not then you can follow nice guide from Cloudera, you don’t need to install HDFS components. We also assume that you have Ceph cluster up and running, if not then you can follow nice how to guide from Ceph Docs. You should install the yarn nodemanagers on each hosts in your cluster where Ceph osds are running. Once you have Yarn and Ceph clusters up and running, follow these steps to make Yarn to start using Ceph.
- Install following packages on each ceph/nodemanager nodes
apt-get install libcephfs-java libcephfs-jni
- Once installed on each host, copy the ceph java lib to hadoop lib folders as
cp /usr/share/java/libcephfs-0.72.2.jar /usr/lib/hadoop/lib/ cp /usr/lib/jni/libcephfs_jni.so* /usr/lib/hadoop/lib/native
- You also will need to install the cephfs-hadoop plugin jar by compiling this git repo including the patch which add support for Hadoop 2. You can download the compiled jar from here which works with Cdh5 beta2.
- Once you have build/downloaded the ceph-hadoop plugin jar copy it to hadoop lib folder as
cp ~/cephfs-hadoop-1.0-SNAPSHOT.jar /usr/lib/hadoop/lib/
- Now we have all the libraries we need for Hadoop to run with ceph. Now its the time to setup the configuration in core-site.xml file to make hadoop use ceph. Make sure you have admin.secret file at the path you mentioning in the config options. Add the following options in the config file.
<property> <name>fs.defaultFS</name> <value>ceph://mon-host:6789/</value> </property> <property> <name>ceph.conf.file</name> <value>/etc/ceph/ceph.conf</value> </property> <property> <name>ceph.auth.id</name> <value>admin</value> </property> <property> <name>ceph.auth.keyfile</name> <value>/etc/hadoop/conf/admin.secret</value> </property> <property> <name>fs.ceph.impl</name> <value>org.apache.hadoop.fs.ceph.CephFileSystem</value> </property> <property> <name>fs.AbstractFileSystem.ceph.impl</name> <value>org.apache.hadoop.fs.ceph.CephHadoop2FileSystem</value> </property> <property> <name>ceph.object.size</name> <value>67108864</value> </property>
- You do need to create the yarn and mapreduce history directories with correct permissions, similar to you would have done for HDFS. Now you can run the following command and see the contents of Ceph “/” path.
# hdfs dfs -ls / Found 3 items drw-r--r-- - yarn 209715222774 2014-04-04 15:24 /benchmarks drwx------ - yarn 40390852613 2014-04-04 13:59 /user drwxrwxrwt - yarn 20298878 2014-04-04 13:59 /var
- Now you have Hadoop Yarn running with Ceph. Enjoy :-)