We will setup a 4 Node Hadoop Cluster using Hadoop 2.7.1 and Ubuntu 14.04.
Our cluster will consist of:
- Ubuntu 14.04
- Hadoop 2.7.1
- HDFS
- 1 Master Node
- 3 Slave Nodes
After we have setup our Hadoop cluster we will also run a wordcount and a streaming job.
This Big Data Series will cover:
- Setup Hadoop and HDFS
- Setup Hive
- Setup Pig
- Setup Spark
Setup Presto(Coming Soon)Setup Impala(Coming Soon)- Example PySpark Application
Example Scala Application(Coming Soon)Example Hive Application(Coming Soon)
Lets get started with our setup:
For simplicity, please disable iptables or any other firewall related service. For the Dependencies section, we will execute this on all the nodes of our cluster
Installing Dependencies:
$ sudo apt-get update && sudo apt-get upgrade -y
$ sudo apt-get install openjdk-7-jdk -y
$ cd /usr/lib/jvm
$ sudo mv java-7-openjdk-amd64 jdk
$ mkdir -p /home/hadoop/mydata/hdfs/namenode
$ mkdir -p /home/hadoop/mydata/hdfs/datanode
Hadoop User and Group:
$ addgroup hadoop
$ adduser --ingroup hadoop hadoop
$ adduser hadoop sudo
Setup Sudoers:
$ echo "%hadoop ALL=(ALL:ALL) NOPASSWD:ALL" >> /etc/sudoers
Setup DNS in Hosts:
$ echo "
127.0.0.1 localhost
172.16.1.1 hadoop-master
172.16.1.11 hadoop-slave1
172.16.1.12 hadoop-slave2
172.16.1.13 hadoop-slave3
" >> /etc/hosts
Setup Hadoop:
$ cd ~/
$ wget http://mirror.cc.columbia.edu/pub/software/apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
$ sudo tar vxzf hadoop-2.7.1.tar.gz -C /usr/local
$ cd /usr/local
$ sudo mv hadoop-2.7.1 hadoop
$ sudo chown -R hadoop:hadoop hadoop
$ sudo chown -R hadoop:hadoop /home/hadoop/mydata/hdfs
$ cd
** Hadoop Variables:*
echo "
# HADOOP
export JAVA_HOME=/usr/lib/jvm/jdk/
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
" >> /home/hadoop/.bashrc
Note: Now that Hadoop is installed on the Master, repeat the previous steps on your slave nodes. Later on we will sync all the configurations over to the slaves that we will edit below.
Once Hadoop is installed on the slave nodes, back to the master:
Hadoop Configuration:
From the directory $HADOOP_HOME/etc/hadoop we will configure:
- core-site.xml
- hdfs-site.xml
- mapred-site.xml
- yarn-site.xml
- masters
- slaves
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:9000</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>default value is 3. It is the number of replicated files across HDFS</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/mydata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/mydata/hdfs/datanode</value>
</property>
<property>
<name>dfs.http.address</name>
<value>hadoop-master:50070</value>
<description>Enter your Primary NameNode hostname for http access.</description>
</property>
<property>
<name>dfs.secondary.http.address</name>
<value>hadoop-slave1:50090</value>
<description>Optional. Only if you want you sec. name node to be on specific node</description>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
masters
hadoop-master
slaves
hadoop-slave1
hadoop-slave2
hadoop-slave3
Setup passwordless SSH between all nodes:
This will be executed from the master node, do this on all the other nodes as well.
$ ssh-keygen -t rsa
$ for x in hadoop-master hadoop-slave1 hadoop-slave2 hadoop-slave3; do ssh-copy-id $x; done
Now that we can SSH to all the nodes without password prompt, lets prepare the directories and sync our data over to the nodes:
From all the core nodes, prepare the directories:
$ mkdir -p /home/hadoop/mydata/hdfs/datanode
$ sudo mkdir /usr/local/hadoop
$ chown -R hadoop:hadoop /home/hadoop/mydata/hdfs
$ sudo chown -R hadoop:hadoop /usr/local/hadoop
Sync the data from our master node to the slave nodes:
$ for x in hadoop-slave1 hadoop-slave2 hadoop-slave3; do rsync -avxP /usr/local/hadoop/ $x:/usr/local/hadoop/ ; done
Format the Namenode:
$ hdfs namenode -format
Start Hadoop:
Start off with starting start-dfs.sh and once that is up, you can start start-yarn.sh and mr-jobhistory-daemon.sh:
$ start-dfs.sh
$ start-yarn.sh
$ mr-jobhistory-daemon.sh start historyserver
Everything should be up and running, we can verify by running jps
, on the master node we should see:
$ jps
2475 Jps
1174 NameNode
2435 JobHistoryServer
1421 ResourceManager
and from the slave nodes, we should see:
$ jps
1051 DataNode
1190 NodeManager
2967 Jps
1110 SecondaryNameNode
Commandline Examples:
HDFS: DFSAdmin
Using the HDFS dfsadmin client, we can view basic filesystem information and statistics:
$ hdfs dfsadmin -report
Configured Capacity: 42017480704 (39.13 GB)
Present Capacity: 36992126976 (34.45 GB)
DFS Remaining: 36987867136 (34.45 GB)
DFS Used: 4259840 (4.06 MB)
DFS Used%: 0.01%
Under replicated blocks: 4
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
-------------------------------------------------
Live datanodes (3):
Name: 172.16.1.11:50010 (hadoop-slave1)
Hostname: hadoop-slave1.ovz.lan.bekkers.co.za
Decommission Status : Normal
Configured Capacity: 21008740352 (19.57 GB)
DFS Used: 667648 (652 KB)
Non DFS Used: 2406580224 (2.24 GB)
DFS Remaining: 18601492480 (17.32 GB)
DFS Used%: 0.00%
DFS Remaining%: 88.54%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Mon Apr 18 18:08:24 EDT 2016
...
HDFS Safemode:
Safe mode is entered automatically at Namenode startup, and leaves safe mode automatically when the configured minimum percentage of blocks satisfies the minimum replication condition. Safe mode can also be entered manually, but then it can only be turned off manually as well:
Entering Safemode:
$ hdfs dfsadmin -safemode enter
Safe mode is ON
Leaving Safemode:
$ hdfs dfsadmin -safemode leave
Safe mode is OFF
YARN
Listing Connected Nodes:
$ yarn node -list
16/04/18 18:19:24 INFO client.RMProxy: Connecting to ResourceManager at hadoop-master/172.16.1.1:8032
Total Nodes:3
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
hadoop-slave2.ovz.lan.bekkers.co.za:49326 RUNNING hadoop-slave2.ovz.lan.bekkers.co.za:8042 0
hadoop-slave1.ovz.lan.bekkers.co.za:49025 RUNNING hadoop-slave1.ovz.lan.bekkers.co.za:8042
...
Listing Running Applications:
$ yarn application -list
16/04/18 18:22:48 INFO client.RMProxy: Connecting to ResourceManager at hadoop-master/172.16.1.1:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1461011206069_0006 word count MAPREDUCE hadoop default RUNNING UNDEFINED 5%
Killing a Application:
$ yarn application -kill application_1461011206069_0006
Running a WordCount Application:
$ cd ~/
$ wget -O words.txt https://dl.dropboxusercontent.com/u/31991539/repo/sample-data/hadoop-words.txt
$ hdfs dfs -mkdir -p /user/hadoop/test/in
$ hdfs dfs -copyFromLocal words.txt /user/hadoop/test/in
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount test/in test/out
$ hdfs dfs -cat /user/hadoop/test/out/part-00000
Running a Hadoop Streaming Job:
$ hdfs dfs -mkdir -p /user/hadoop/test2/in
$ hdfs dfs -copyFromLocal words.txt /user/hadoop/test2/in
$ hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -input test2/in -output test2/out -mapper /bin/cat -reducer /usr/bin/wc
$ hdfs dfs -cat /user/hadoop/test2/out/part-00000
Next, setup Hive.
Related: Bash Script: Setup a 3 Node Hadoop Cluster on Containers
Comments