Installing Hadoop

= Hadoop =

Java
in /etc/apt/sources.list add:

deb http://ppa.launchpad.net/ferramroberto/java/ubuntu lucid main deb-src http://ppa.launchpad.net/ferramroberto/java/ubuntu lucid main Run:

apt-get install python-software-properties add-apt-repository ppa:ferramroberto/java add-apt-repository --remove ppa:ferramroberto/java apt-get update apt-get install sun-java6-jdk java -version

Hadoop user
Make the hadoop user folder at /hadoop.

Make the hadoop user with useradd -d /hadoop -s /bin/bash hadoop. Make sure that the /hadoop folder is now owned by hadoop:hadoop.

Chmod the hadoop user directory to 700

Passwordless SSH
Make a key with

sudo -u hadoop ssh-keygen -f /hadoop/.ssh/id_rsa -t rsa -P &quot;&quot; Create the file:

sudo -u hadoop touch /hadoop/.ssh/authorized_keys Accept the key

cat /hadoop/.ssh/id_rsa.pub &gt;&gt; /hadoop/.ssh/authorized_keys Test this by running ssh localhost as the hadoop user.

Set up hadoop:
To disable ipv6, in /etc/sysctl.conf add:

net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 Add to /etc/apt/sources.list:

deb http://archive.cloudera.com/debian lucid-cdh2 contrib deb-src http://archive.cloudera.com/debian lucid-cdh2 contrib Run:

wget -q -O - http://archive.cloudera.com/debian/archive.key | apt-key add -| apt-get update

apt-get install hadoop hadoop-pipes hadoop-0.20-conf-pseudo

Run mkdir -p /app/hadoop/tmp

Change its owner to hadoop and chmod it to 750

Set up conf files
Find files in /etc/hadoop-0.20/conf.pseudo


 * hdfs-site.xml
 * core-site.xml
 * mapred-site.xml

and move them to /usr/lib/hadoop/conf.

Change the owner of /usr/lib/hadoop to user hadoop. Use chown -R. Set the java home in /usr/lib/hadoop/conf/hadoop-env.sh to /usr/lib/jvm/java-6-sun/.


 * In /usr/lib/hadoop/conf/hdfs-site.xml: set dfs.replication to 3, remove dfs.permissions
 * In core-site.xml: set fs.default.name to hdfs://CLUSTERNAME:8020
 * In mapred-site.xml: set job.tracker to CLUSTERNAME:8021.

.

At this point, hadoop is set up on the head node.

Golden Node
Install Java the same way.

Copy the ~/.ssh folder from the head node to the golden node (tricky because of its permissions: go via the /tmp/ folder).

Mounting the data directories
Run `fdisk -l` as root to see which partition (/dev/sda&lt;n&gt; or likewise) the /data directory is on, which I will refer to as PARTITION. Then add the following to /etc/fstab:

PARTITION      /data           ext4    defaults        0       0 Run mount -a.

mkdir -p the following directories:


 * /data/hadoop/tmp/hadoop/dfs/data
 * /data/hadoop/tmp/hadoop/mapred/local
 * /data/hadoop/dfs/name

To change their owner to hadoop, run chown -R hadoop:hadoop /data/hadoop.

Hadoop Configuration
All of the following files are in /usr/lib/hadoop/.


 * In hdfs-site.xml, dfs.name.dir needs to be &quot;/data/hadoop/dfs/name&quot; (the last /data directory created).
 * In core-site.xml, set hadoop.tmp.dir to &quot;/data/hadoop/tmp/${user.name}&quot;

On the head node:
In /usr/lib/hadoop/conf, edit files masters and slaves (create if they don't exist):


 * In masters: add only the name of the headnode
 * In slaves: add the names of each slave node set up with hadoop, each on a different line

At this point, you should clone the golden node to the other cluster nodes.

On the head node, format the namenode with sudo -u hadoop /usr/lib/hadoop/bin/hadoop namenode -format. Then start the cluster with sudo -u hadoop /usr/lib/hadoop/bin/start-all.sh.

To confirm this worked, run sudo jps on both the head node and the golden node. On master you should have namenode, tasktracker, datanode, jobtracker, and secondarynamenode. On the slave node you should have datanode and tasktracker