Skip to content
Nov 7 / Nizam Sayeed

Setting up your own Hadoop cluster with Cloudera distro for EC2

Installing Prerequisites

Before you begin, figure out what distro you are on (if you don’t already know) by issuing this command from the shell:

lsb_release -c

For my set up, I used Ubuntu 8.04.3 LTS Hardy Heron.

Add Cloudera repositories for your distro to apt sources. Create a file called /etc/apt/sources.list.d/cloudera.list and add the following two lines in it:

deb http://archive.cloudera.com/debian [distro]-stable contrib
deb-src http://archive.cloudera.com/debian [distro]-stable contrib

Add repository GPG key to your keyring:

curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -

Update apt:

sudo apt-get update

Install Hadoop:

sudo apt-get install hadoop

Install boto and simplejson for Python (I used pip but you could use easy_install as well):

sudo pip install boto simplejson

Setting Up Cloudera Client Scripts

Download the Cloudera client scripts:

wget http://cloudera-packages.s3.amazonaws.com/cloudera-for-hadoop-on-ec2-py-0.3.0-beta.tar.gz

The location of the scripts may change and the only link to the scripts can be found on this page:

http://archive.cloudera.com/docs/_getting_started.html

Once downloaded, extract the archive’s contents to /opt/cloudera. You could put it anywhere else you like.

Modify your environment variables and add the following to your .bash_profile:

export AWS_ACCESS_KEY_ID=[your AWS access key ID]
export AWS_SECRET_ACCESS_KEY=[your AWS secret access key]
export PATH=$PATH:/opt/cloudera

The Cloudera client scripts need to have access to your AWS credentials from the system environment in order to work properly. Adding the path to where the the scripts reside will allow you to call the hadoop-ec2 command that comes bundled with the package.

Source your .bash_profile to update your environment with the changes.

source ~/.bash_profile

Create your .hadoop-ec2 config directory in your home directory:

mkdir .hadoop-ec2

Create a file called ec2-clusters.cfg in your .hadoop-ec2 directory with the following contents:

[my-hadoop-cluster]
ami=ami-6159bf08
instance_type=c1.medium
key_name=[your EC2 key name]
availability_zone=us-east-1b
private_key=/path/to/keypair.pem
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no

You could have multiple blocks for each of your clusters defined in this file. We just created on block for a cluster named my-hadoop-cluster.

Test that the Cloudera client scripts are working:

hadoop-ec2 list

If everything is configured properly up to this point, you should get output that says:

No running clusters

Launch your cluster (in this case with a master and one slave node):

hadoop-ec2 launch-cluster my-hadoop-cluster 1

Set up a proxy to the cluster:

hadoop-ec2 proxy my-hadoop-cluster

Setting Up Pig

The version of Pig that is a part of the Cloudera apt repositories was a little out of date for me so I decided to install a more updated version manually. The version of Hadoop that runs on the cluster is 0.18.3 and the latest version of Pig that is compatible with that Hadoop version is 0.4.0. You can download it from here:

wget http://apache.cs.utah.edu/hadoop/pig/pig-0.4.0/pig-0.4.0.tar.gz

If the link above doesn’t work, then try any of the mirrors for Pig listed on this page:

http://www.apache.org/dyn/closer.cgi/hadoop/pig

Extract the contents of the archive to /opt/pig. Again, you could put it anywhere you desire.

Add the following to your .bash_profile:

export HADOOP_CONF_DIR=~/.hadoop-ec2/my-hadoop-cluster
export HADOOP_HOME=/usr/lib/hadoop
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export PIG_CLASSPATH=/opt/pig/pig-0.4.0-core.jar:$HADOOP_CONF_DIR
export PIG_HADOOP_VERSION=0.18.3

The HADOOP_CONF_DIR environment variable should point to the location of your cluster’s hadoop-site.xml file. If you are going to be running multiple clusters, you may want to create some shell scripts that set the appropriate environment variables for each cluster before running any commands like Pig, which depend on those values.

Don’t forget to add the path to the Pig bin directory to your path as well:

export PATH=$PATH:/opt/cloudera:/opt/pig/bin

Source your .bash_profile to update your environment with the changes and then run the pig command:

pig

It should launch Pig in mapreduce mode by default. In this mode, the grunt shell will be connected to your EC2 Hadoop cluster vs. your local Hadoop installation. You should see something like the following:

2009-11-07 15:02:15,930 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://ec2-xx-xxx-xx-xx.compute-1.amazonaws.com:8020/
2009-11-07 15:02:16,653 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: ec2-xx-xxx-xx-xx.compute-1.amazonaws.com:8021

Congratulations, you now have a fully functioning Hadoop cluster running on EC2 with Pig ready to do your bidding.

  • Rizwan

    hi,

    Can we use the same procedure on Ubuntu enterprise cloud to setup hadoop cluster?

    regards