Setting up your own Hadoop cluster with Cloudera distro for EC2
Installing Prerequisites
Before you begin, figure out what distro you are on (if you don’t already know) by issuing this command from the shell:
lsb_release -c
For my set up, I used Ubuntu 8.04.3 LTS Hardy Heron.
Add Cloudera repositories for your distro to apt sources. Create a file called /etc/apt/sources.list.d/cloudera.list and add the following two lines in it:
deb http://archive.cloudera.com/debian [distro]-stable contrib deb-src http://archive.cloudera.com/debian [distro]-stable contrib
Add repository GPG key to your keyring:
curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
Update apt:
sudo apt-get update
Install Hadoop:
sudo apt-get install hadoop
Install boto and simplejson for Python (I used pip but you could use easy_install as well):
sudo pip install boto simplejson
Setting Up Cloudera Client Scripts
Download the Cloudera client scripts:
wget http://cloudera-packages.s3.amazonaws.com/cloudera-for-hadoop-on-ec2-py-0.3.0-beta.tar.gz
The location of the scripts may change and the only link to the scripts can be found on this page:
http://archive.cloudera.com/docs/_getting_started.html
Once downloaded, extract the archive’s contents to /opt/cloudera. You could put it anywhere else you like.
Modify your environment variables and add the following to your .bash_profile:
export AWS_ACCESS_KEY_ID=[your AWS access key ID] export AWS_SECRET_ACCESS_KEY=[your AWS secret access key] export PATH=$PATH:/opt/cloudera
The Cloudera client scripts need to have access to your AWS credentials from the system environment in order to work properly. Adding the path to where the the scripts reside will allow you to call the hadoop-ec2 command that comes bundled with the package.
Source your .bash_profile to update your environment with the changes.
source ~/.bash_profile
Create your .hadoop-ec2 config directory in your home directory:
mkdir .hadoop-ec2
Create a file called ec2-clusters.cfg in your .hadoop-ec2 directory with the following contents:
[my-hadoop-cluster] ami=ami-6159bf08 instance_type=c1.medium key_name=[your EC2 key name] availability_zone=us-east-1b private_key=/path/to/keypair.pem ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
You could have multiple blocks for each of your clusters defined in this file. We just created on block for a cluster named my-hadoop-cluster.
Test that the Cloudera client scripts are working:
hadoop-ec2 list
If everything is configured properly up to this point, you should get output that says:
No running clusters
Launch your cluster (in this case with a master and one slave node):
hadoop-ec2 launch-cluster my-hadoop-cluster 1
Set up a proxy to the cluster:
hadoop-ec2 proxy my-hadoop-cluster
Setting Up Pig
The version of Pig that is a part of the Cloudera apt repositories was a little out of date for me so I decided to install a more updated version manually. The version of Hadoop that runs on the cluster is 0.18.3 and the latest version of Pig that is compatible with that Hadoop version is 0.4.0. You can download it from here:
wget http://apache.cs.utah.edu/hadoop/pig/pig-0.4.0/pig-0.4.0.tar.gz
If the link above doesn’t work, then try any of the mirrors for Pig listed on this page:
http://www.apache.org/dyn/closer.cgi/hadoop/pig
Extract the contents of the archive to /opt/pig. Again, you could put it anywhere you desire.
Add the following to your .bash_profile:
export HADOOP_CONF_DIR=~/.hadoop-ec2/my-hadoop-cluster export HADOOP_HOME=/usr/lib/hadoop export JAVA_HOME=/usr/lib/jvm/java-6-sun export PIG_CLASSPATH=/opt/pig/pig-0.4.0-core.jar:$HADOOP_CONF_DIR export PIG_HADOOP_VERSION=0.18.3
The HADOOP_CONF_DIR environment variable should point to the location of your cluster’s hadoop-site.xml file. If you are going to be running multiple clusters, you may want to create some shell scripts that set the appropriate environment variables for each cluster before running any commands like Pig, which depend on those values.
Don’t forget to add the path to the Pig bin directory to your path as well:
export PATH=$PATH:/opt/cloudera:/opt/pig/bin
Source your .bash_profile to update your environment with the changes and then run the pig command:
pig
It should launch Pig in mapreduce mode by default. In this mode, the grunt shell will be connected to your EC2 Hadoop cluster vs. your local Hadoop installation. You should see something like the following:
2009-11-07 15:02:15,930 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://ec2-xx-xxx-xx-xx.compute-1.amazonaws.com:8020/ 2009-11-07 15:02:16,653 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: ec2-xx-xxx-xx-xx.compute-1.amazonaws.com:8021
Congratulations, you now have a fully functioning Hadoop cluster running on EC2 with Pig ready to do your bidding.