alt

This is part 3 of our Big Data Cluster Setup.

From our Previous Post I was going through the steps on getting your Hadoop Cluster up and running.

In this tutorial, we will setup Apache Spark, on top of the Hadoop Ecosystem.

Our cluster will consist of:

  • Ubuntu 14.04
  • Hadoop 2.7.1
  • HDFS
  • 1 Master Node
  • 3 Slave Nodes

After we have setup our Spark cluster we will also run a a SparkPi example, but please have a look at the example applications on PySpark and Scala as we will go through step by step on generating some sample data and filtering our data for some patterns.

This Big Data Series will cover:

  1. Setup Hadoop and HDFS
  2. Setup Hive
  3. Setup Pig
  4. Setup Spark
  5. Example PySpark Application
  6. Example Scala Application (Coming Soon)
  • Setup Hive and Pig (Coming Soon)
  • Setup Presto (Coming Soon)
  • Setup Impala (Coming Soon)

Lets get started with our setup:

Setup Scala:

$ sudo su hadoop
$ cd ~/
$ wget http://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz
$ tar -xvf scala-2.11.8.tgz
$ sudo mv scala-2.11.8 /usr/local/scala
$ sudo chown -R hadoop:hadoop /usr/local/scala

Setup Spark:

$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
$ tar -xvf spark-1.6.1-bin-hadoop2.6.tgz
$ sudo cp -r spark-1.6.1-bin-hadoop2.6 /usr/local/spark
$ sudo chown -R hadoop:hadoop /usr/local/spark

Spark Configuration:

Configure Spark Environment Variables:

$ cd /usr/local/spark/conf
$ cp spark-env.sh.template spark-env.sh

Edit spark-env.sh according to your needs, more information on the definitions can be found here

$ vi spark-env.sh
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="default"  

Source Spark Profile:

$ source spark-env.sh

Configure Spark Slaves:

$ cd /usr/local/spark/conf
$ mv slaves.template slaves
$ vi slaves
hadoop-slave1
hadoop-slave2
hadoop-slave3

Configure Paths:

$ vi ~/.profile
# Scala
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
# Spark
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
export PATH

Source Profile:

$ source ~/.profile

Repeat Installation on Slave Nodes:

Once Spark is also installed on the slave nodes of your Hadoop Cluster, on the Master continue with:

Start Spark Cluster:

$ cd /usr/local/spark/sbin
$ ./stop-all.sh
$ ./start-all.sh

Interacting with Spark:

Spark Shell is available in Scala and Python for Interactive usage, to start them:

For Python, we have pyspark:

$ pyspark

Welcome to
       ___               __
     / __/__  ___  _ ___/ /__
    _\ \/ _ \/ _ `/ __/  `_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
      /_/
         

Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
SparkSession available as 'spark'.
>>> 

For Scala we have spark-shell:

$ spark-shell

Welcome to
       ___               __
     / __/__  ___  _ ___/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
      /_/
         
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.7.0_101)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

Submit work to the Cluster with spark-submit:

$ spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  <application-jar> \
  [application-arguments]
``` <p>

For example, you can submit your spark job to YARN:

```bash
$ spark-submit \
--master yarn \
--deploy-mode cluster \ 
--num-executors 2 \
--executor-cores 2 \
--executor-memory 1G \
–conf spark.yarn.submit.waitAppCompletion=false \
wordcount.py  

Running a Spark Sample Program:

Spark comes with a couple of sample programs, we will run the SparkPi sample:

$ cd $SPARK_HOME
$ bin/run-example SparkPi 10
Pi is roughly 3.13918

Examples on Pyspark and Scala Apps:

Examples on PySpark and Scala applications can be found from the links.