alt

We will setup a 4 Node Hadoop Cluster using Hadoop 2.7.1 and Ubuntu 14.04.

Our cluster will consist of:

  • Ubuntu 14.04
  • Hadoop 2.7.1
  • HDFS
  • 1 Master Node
  • 3 Slave Nodes

After we have setup our Hadoop cluster we will also run a wordcount and a streaming job.

This Big Data Series will cover:

  1. Setup Hadoop and HDFS
  2. Setup Hive
  3. Setup Pig
  4. Setup Spark
  5. Setup Presto (Coming Soon)
  6. Setup Impala (Coming Soon)
  7. Example PySpark Application
  8. Example Scala Application (Coming Soon)
  9. Example Hive Application (Coming Soon)

Lets get started with our setup:

For simplicity, please disable iptables or any other firewall related service. For the Dependencies section, we will execute this on all the nodes of our cluster

Installing Dependencies:

$ sudo apt-get update && sudo apt-get upgrade -y
$ sudo apt-get install openjdk-7-jdk -y 
$ cd /usr/lib/jvm
$ sudo mv java-7-openjdk-amd64 jdk
$ mkdir -p /home/hadoop/mydata/hdfs/namenode
$ mkdir -p /home/hadoop/mydata/hdfs/datanode

Hadoop User and Group:

$ addgroup hadoop
$ adduser --ingroup hadoop hadoop
$ adduser hadoop sudo

Setup Sudoers:

$ echo "%hadoop ALL=(ALL:ALL) NOPASSWD:ALL" >> /etc/sudoers

Setup DNS in Hosts:

$ echo "
127.0.0.1 localhost
172.16.1.1 hadoop-master
172.16.1.11 hadoop-slave1
172.16.1.12 hadoop-slave2
172.16.1.13 hadoop-slave3
" >> /etc/hosts

Setup Hadoop:

$ cd ~/
$ wget http://mirror.cc.columbia.edu/pub/software/apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
$ sudo tar vxzf hadoop-2.7.1.tar.gz -C /usr/local
$ cd /usr/local
$ sudo mv hadoop-2.7.1 hadoop
$ sudo chown -R hadoop:hadoop hadoop
$ sudo chown -R hadoop:hadoop /home/hadoop/mydata/hdfs
$ cd

** Hadoop Variables:*

echo "
# HADOOP
export JAVA_HOME=/usr/lib/jvm/jdk/
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
" >> /home/hadoop/.bashrc

Note: Now that Hadoop is installed on the Master, repeat the previous steps on your slave nodes. Later on we will sync all the configurations over to the slaves that we will edit below.

Once Hadoop is installed on the slave nodes, back to the master:

Hadoop Configuration:

From the directory $HADOOP_HOME/etc/hadoop we will configure:

  • core-site.xml
  • hdfs-site.xml
  • mapred-site.xml
  • yarn-site.xml
  • masters
  • slaves

core-site.xml

<configuration>
<property>
 <name>fs.default.name</name>
 <value>hdfs://hadoop-master:9000</value>
</property>
</configuration>

hdfs-site.xml

<configuration>
<property>
 <name>dfs.replication</name>
 <value>1</value>
 <description>default value is 3. It is the number of replicated files across HDFS</description>
</property>
<property>
 <name>dfs.namenode.name.dir</name>
 <value>file:/home/hadoop/mydata/hdfs/namenode</value>
</property>
<property>
 <name>dfs.datanode.data.dir</name>
 <value>file:/home/hadoop/mydata/hdfs/datanode</value>
</property>
<property>
 <name>dfs.http.address</name>
 <value>hadoop-master:50070</value>
 <description>Enter your Primary NameNode hostname for http access.</description>
</property>
<property>
 <name>dfs.secondary.http.address</name>
 <value>hadoop-slave1:50090</value>
 <description>Optional. Only if you want you sec. name node to be on specific node</description>
</property>
</configuration>

mapred-site.xml

<configuration>
<property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
</property>
</configuration>

yarn-site.xml

<configuration>
<property>
 <name>mapreduce.framework.name</name>
 <value>yarn</value>
</property>
</configuration>

masters

hadoop-master

slaves

hadoop-slave1
hadoop-slave2
hadoop-slave3

Setup passwordless SSH between all nodes:

This will be executed from the master node, do this on all the other nodes as well.

$ ssh-keygen -t rsa
$ for x in hadoop-master hadoop-slave1 hadoop-slave2 hadoop-slave3; do ssh-copy-id $x; done

Now that we can SSH to all the nodes without password prompt, lets prepare the directories and sync our data over to the nodes:

From all the core nodes, prepare the directories:

$ mkdir -p /home/hadoop/mydata/hdfs/datanode 
$ sudo mkdir /usr/local/hadoop
$ chown -R hadoop:hadoop /home/hadoop/mydata/hdfs
$ sudo chown -R hadoop:hadoop /usr/local/hadoop

Sync the data from our master node to the slave nodes:

$ for x in hadoop-slave1 hadoop-slave2 hadoop-slave3; do rsync -avxP /usr/local/hadoop/ $x:/usr/local/hadoop/ ; done

Format the Namenode:

$ hdfs namenode -format

Start Hadoop:

Start off with starting start-dfs.sh and once that is up, you can start start-yarn.sh and mr-jobhistory-daemon.sh:

$ start-dfs.sh
$ start-yarn.sh
$ mr-jobhistory-daemon.sh start historyserver

Everything should be up and running, we can verify by running jps, on the master node we should see:

$ jps
2475 Jps
1174 NameNode
2435 JobHistoryServer
1421 ResourceManager

and from the slave nodes, we should see:

$ jps
1051 DataNode
1190 NodeManager
2967 Jps
1110 SecondaryNameNode

Commandline Examples:

HDFS: DFSAdmin

Using the HDFS dfsadmin client, we can view basic filesystem information and statistics:

$ hdfs dfsadmin -report

Configured Capacity: 42017480704 (39.13 GB)
Present Capacity: 36992126976 (34.45 GB)
DFS Remaining: 36987867136 (34.45 GB)
DFS Used: 4259840 (4.06 MB)
DFS Used%: 0.01%
Under replicated blocks: 4
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0

-------------------------------------------------
Live datanodes (3):

Name: 172.16.1.11:50010 (hadoop-slave1)
Hostname: hadoop-slave1.ovz.lan.bekkers.co.za
Decommission Status : Normal
Configured Capacity: 21008740352 (19.57 GB)
DFS Used: 667648 (652 KB)
Non DFS Used: 2406580224 (2.24 GB)
DFS Remaining: 18601492480 (17.32 GB)
DFS Used%: 0.00%
DFS Remaining%: 88.54%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Mon Apr 18 18:08:24 EDT 2016
...

HDFS Safemode:

Safe mode is entered automatically at Namenode startup, and leaves safe mode automatically when the configured minimum percentage of blocks satisfies the minimum replication condition. Safe mode can also be entered manually, but then it can only be turned off manually as well:

Entering Safemode:

$ hdfs dfsadmin -safemode enter
Safe mode is ON

Leaving Safemode:

$ hdfs dfsadmin -safemode leave
Safe mode is OFF

YARN

Listing Connected Nodes:

$ yarn node -list
16/04/18 18:19:24 INFO client.RMProxy: Connecting to ResourceManager at hadoop-master/172.16.1.1:8032
Total Nodes:3
         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
hadoop-slave2.ovz.lan.bekkers.co.za:49326               RUNNING hadoop-slave2.ovz.lan.bekkers.co.za:8042                                   0
hadoop-slave1.ovz.lan.bekkers.co.za:49025               RUNNING hadoop-slave1.ovz.lan.bekkers.co.za:8042
...

Listing Running Applications:

$ yarn application -list
16/04/18 18:22:48 INFO client.RMProxy: Connecting to ResourceManager at hadoop-master/172.16.1.1:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
                Application-Id      Application-Name        Application-Type          User           Queue                   State             Final-State             Progress                        Tracking-URL
application_1461011206069_0006            word count               MAPREDUCE        hadoop         default                 RUNNING               UNDEFINED                   5%

Killing a Application:

$ yarn application -kill application_1461011206069_0006

Running a WordCount Application:

$ cd ~/
$ wget -O words.txt https://dl.dropboxusercontent.com/u/31991539/repo/sample-data/hadoop-words.txt

$ hdfs dfs -mkdir -p /user/hadoop/test/in
$ hdfs dfs -copyFromLocal words.txt /user/hadoop/test/in
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount test/in test/out
$ hdfs dfs -cat /user/hadoop/test/out/part-00000

Running a Hadoop Streaming Job:

$ hdfs dfs -mkdir -p /user/hadoop/test2/in
$ hdfs dfs -copyFromLocal words.txt /user/hadoop/test2/in
$ hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -input test2/in -output test2/out -mapper /bin/cat -reducer /usr/bin/wc
$ hdfs dfs -cat /user/hadoop/test2/out/part-00000

Next, setup Hive.

Related: Bash Script: Setup a 3 Node Hadoop Cluster on Containers