Techie Talks: Integrating Cassandra data nodes in a Hadoop cluster

1. Networking setup

Make sure the machines are able to reach each other on the network. Also update /etc/hosts on all machines. For example in our setup we use:

# /etc/hosts on all nodes
192.168.1.96 master
192.168.1.97 slave1
192.168.1.98 slave2
192.168.1.99 slave3

2. Java

Install java 6 if not already installed.

$ sudo add-apt-repository "deb http://archive.canonical.com/ubuntu maverick partner"
$ sudo apt-get update
$ sudo apt-get install sun-java6-jre

3. Hadoop installation

Add Cloudera CDH3 apt repository.

$ sudo add-apt-repository "deb http://archive.cloudera.com/debian maverick-cdh3 contrib"
$ wget -O - http://archive.cloudera.com/debian/archive.key | sudo apt-key add -

Install Hadoop.
On master:

$ sudo apt-get install hadoop-0.20-{namenode,datanode,jobtracker,tasktracker}

On slaves:

$ sudo apt-get install hadoop-0.20-{datanode,tasktracker}

This will create: hadoop group, hdfs and mapred users. The namenode and datanode daemons will run use hdfs user. The jobtracker and tasktracker daemons will use mapred user.
You can find Clouderaís patches for hadoop here: /usr/lib/hadoop-0.20/cloudera/patches

4. Hadoop configuration.

Configure hadoop on all machines:
Edit core-site.xml

 <property>
   <name> fs.default.name </name>
   <value> hdfs://master:8020 </value>
 </property>

Edit hdfs-site.xml

 <property>
   <name> dfs.replication </name>
   <value> 2 </value>
 </property>

Edit mapred-site.xml

 <property>
   <name> mapred.job.tracker </name>
   <value> master:8021 </value>
 </property>

Stating the cluster is done is 2 steps:
Step1: Starting HDFS
The namenode daemon must be started on master:

$ sudo service hadoop-0.20-namenode start

Then datanode daemons must be started on all slaves (in our setup: master, slave1, slave2 and slave3):

$ sudo service hadoop-0.20-datanode start

Check the success or failure by inspecting the logs on master and slaves:
Check namenode log on master

$ sudo less /var/log/hadoop-0.20/hadoop-hadoop-namenode-demo.log

Check datanode logs on all slaves

$ sudo less /var/log/hadoop-0.20/hadoop-hadoop-datanode-demo.log

Step2: Starting MapReduce
The jobtracker daemon must be started on master:

$ sudo service hadoop-0.20-jobtracker start

Then tasktracker daemons must be started on all slaves (in our setup: master, slave1, slave2 and slave3):

$ sudo service hadoop-0.20-tasktracker start

Check the success or failure by inspecting the logs on master and slaves:
Check jobtracker log on master

$ sudo less /var/log/hadoop-0.20/hadoop-hadoop-jobtracker-demo.log

Check tasktracker logs on all slaves

$ sudo less /var/log/hadoop-0.20/hadoop-hadoop-tasktracker-demo.log

5. Install Cassandra

You must do this on all cassandra nodes. A common practice is to install cassandra on every hadoop datanode. So every hadoop datanode will also be a cassandra node.
Import cassandraís apt repository key

gpg --keyserver pgp.mit.edu --recv-keys F758CE318D77295D
gpg --export --armor F758CE318D77295D | sudo apt-key add -

Add cassandraís apt repository entries in /etc/apt/sources.list and install cassandra.

$ sudo add-apt-repository "deb http://www.apache.org/dist/cassandra/debian unstable main"
$ sudo apt-get update
$ sudo apt-get install cassandra

6. Configure Cassandra

Edit cassandra.yaml on all nodes, replace hostname_or_ip with the hostname or the ip of the node.

listen_address: 
rpc_address:

After you configured all the nodes you can check cassandra cluster status using:

$ /usr/bin/nodetool -host  ring

Replace with the hostname or ip of a node. If you properly configured the cluster then you shall see a list of all the nodes from the cluster along with their assigned tokens.

7. Balance Cassandra Cluster

If you add nodes to your cluster your ring will be unbalanced and only way to get perfect balance is to compute new tokens for every node and assign them to each node manually by using nodetool move command.
Here's a python program which can be used to calculate new tokens for the nodes.

def tokens(nodes):
 for x in xrange(nodes):
  print 2 ** 127 / nodes * x

There's also nodetool loadbalance which is essentially a convenience over decommission + bootstrap, only instead of telling the target node where to move on the ring it will choose its location based on the same heuristic as Token selection on bootstrap. You should not use this as it doesn't rebalance the entire ring.
The status of move and balancing operations can be monitored using nodetool with the streams argument.

 /usr/bin/nodetool -host  move

Techie Talks

Thursday 31 May 2012

Integrating Cassandra data nodes in a Hadoop cluster