This is part 3 of our Big Data Cluster Setup.
From our Previous Post I was going through the steps on getting your Hadoop Cluster up and running.
In this tutorial, we will setup Apache Spark, on top of the Hadoop Ecosystem.
Our cluster will consist of:
- Ubuntu 14.04
- Hadoop 2.7.1
- HDFS
- 1 Master Node
- 3 Slave Nodes
After we have setup our Spark cluster we will also run a a SparkPi example, but please have a look at the example applications on PySpark and Scala as we will go through step by step on generating some sample data and filtering our data for some patterns.
This Big Data Series will cover:
- Setup Hadoop and HDFS
- Setup Hive
- Setup Pig
- Setup Spark
- Example PySpark Application
Example Scala Application(Coming Soon)
Setup Hive and Pig(Coming Soon)Setup Presto(Coming Soon)Setup Impala(Coming Soon)
Lets get started with our setup:
Setup Scala:
$ sudo su hadoop
$ cd ~/
$ wget http://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz
$ tar -xvf scala-2.11.8.tgz
$ sudo mv scala-2.11.8 /usr/local/scala
$ sudo chown -R hadoop:hadoop /usr/local/scala
Setup Spark:
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
$ tar -xvf spark-1.6.1-bin-hadoop2.6.tgz
$ sudo cp -r spark-1.6.1-bin-hadoop2.6 /usr/local/spark
$ sudo chown -R hadoop:hadoop /usr/local/spark
Spark Configuration:
Configure Spark Environment Variables:
$ cd /usr/local/spark/conf
$ cp spark-env.sh.template spark-env.sh
Edit spark-env.sh
according to your needs, more information on the definitions can be found here
$ vi spark-env.sh
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="default"
Source Spark Profile:
$ source spark-env.sh
Configure Spark Slaves:
$ cd /usr/local/spark/conf
$ mv slaves.template slaves
$ vi slaves
hadoop-slave1
hadoop-slave2
hadoop-slave3
Configure Paths:
$ vi ~/.profile
# Scala
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
# Spark
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
export PATH
Source Profile:
$ source ~/.profile
Repeat Installation on Slave Nodes:
Once Spark is also installed on the slave nodes of your Hadoop Cluster, on the Master continue with:
Start Spark Cluster:
$ cd /usr/local/spark/sbin
$ ./stop-all.sh
$ ./start-all.sh
Interacting with Spark:
Spark Shell is available in Scala and Python for Interactive usage, to start them:
For Python, we have pyspark
:
$ pyspark
Welcome to
___ __
/ __/__ ___ _ ___/ /__
_\ \/ _ \/ _ `/ __/ `_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT
/_/
Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
SparkSession available as 'spark'.
>>>
For Scala we have spark-shell
:
$ spark-shell
Welcome to
___ __
/ __/__ ___ _ ___/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.0-SNAPSHOT
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.7.0_101)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Submit work to the Cluster with spark-submit
:
$ spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
<application-jar> \
[application-arguments]
``` <p>
For example, you can submit your spark job to YARN:
```bash
$ spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 2 \
--executor-cores 2 \
--executor-memory 1G \
–conf spark.yarn.submit.waitAppCompletion=false \
wordcount.py
Running a Spark Sample Program:
Spark comes with a couple of sample programs, we will run the SparkPi sample:
$ cd $SPARK_HOME
$ bin/run-example SparkPi 10
Pi is roughly 3.13918
Examples on Pyspark and Scala Apps:
Examples on PySpark and Scala applications can be found from the links.
Comments