/ BigData

Setup Spark Cluster on Hadoop YARN

alt

This is part 3 of our Big Data Cluster Setup.

From our Previous Post I was going through the steps on getting your Hadoop Cluster up and running.

In this tutorial, we will setup Apache Spark, on top of the Hadoop Ecosystem.

Our cluster will consist of:

  • Ubuntu 14.04
  • Hadoop 2.7.1
  • HDFS
  • 1 Master Node
  • 3 Slave Nodes

After we have setup our Spark cluster we will also run a a SparkPi example, but please have a look at the example applications on PySpark and Scala as we will go through step by step on generating some sample data and filtering our data for some patterns.

This Big Data Series will cover:

  1. Setup Hadoop and HDFS
  2. Setup Hive
  3. Setup Pig
  4. Setup Spark
  5. Example PySpark Application
  6. Example Scala Application (Coming Soon)
  • Setup Hive and Pig (Coming Soon)
  • Setup Presto (Coming Soon)
  • Setup Impala (Coming Soon)

Lets get started with our setup:

Setup Scala:

$ sudo su hadoop
$ cd ~/
$ wget http://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz
$ tar -xvf scala-2.11.8.tgz
$ sudo mv scala-2.11.8 /usr/local/scala
$ sudo chown -R hadoop:hadoop /usr/local/scala
``` <p>


**Setup Spark:**

```language-bash
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
$ tar -xvf spark-1.6.1-bin-hadoop2.6.tgz
$ sudo cp -r spark-1.6.1-bin-hadoop2.6 /usr/local/spark
$ sudo chown -R hadoop:hadoop /usr/local/spark
``` <p>

**Spark Configuration:**

Configure Spark Environment Variables:

```language-bash
$ cd /usr/local/spark/conf
$ cp spark-env.sh.template spark-env.sh
``` <p>

Edit `spark-env.sh` according to your needs, more information on the definitions can be found **[here](http://spark.apache.org/docs/latest/configuration.html#environment-variables)**

```language-bash
$ vi spark-env.sh
``` <p>

```language-bash
HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop"
SPARK_YARN_QUEUE="default"  
``` <p>

Source Spark Profile:

```language-bash
$ source spark-env.sh
``` <p>

Configure Spark Slaves:

```language-bash
$ cd /usr/local/spark/conf
$ mv slaves.template slaves
$ vi slaves
``` <p>

```language-bash
hadoop-slave1
hadoop-slave2
hadoop-slave3
``` <p>

Configure Paths:

```language-bash
$ vi ~/.profile
``` <p>

```language-bash
# Scala
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
# Spark
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
export PATH
``` <p>

Source Profile:

```language-bash
$ source ~/.profile
``` <p>


**Repeat Installation on Slave Nodes:**

Once Spark is also installed on the slave nodes of your Hadoop Cluster, on the Master continue with:


**Start Spark Cluster:**

```language-bash
$ cd /usr/local/spark/sbin
$ ./stop-all.sh
$ ./start-all.sh
``` <p>

**Interacting with Spark:**

Spark Shell is available in Scala and Python for Interactive usage, to start them:

For Python, we have **`pyspark`**:

```language-bash
$ pyspark

Welcome to
       ___               __
     / __/__  ___  _ ___/ /__
    _\ \/ _ \/ _ `/ __/  `_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
      /_/
         

Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
SparkSession available as 'spark'.
>>> 

``` <p>

For Scala we have **`spark-shell`**:

```language-bash
$ spark-shell

Welcome to
       ___               __
     / __/__  ___  _ ___/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
      /_/
         
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.7.0_101)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 
``` <p>

Submit work to the Cluster with **`spark-submit`**:

```language-bash
$ spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  <application-jar> \
  [application-arguments]
``` <p>

For example, you can submit your spark job to YARN:

```language-bash
$ spark-submit \
--master yarn \
--deploy-mode cluster \ 
--num-executors 2 \
--executor-cores 2 \
--executor-memory 1G \
–conf spark.yarn.submit.waitAppCompletion=false \
wordcount.py  
``` <p>

**Running a Spark Sample Program:**

Spark comes with a couple of sample programs, we will run the SparkPi sample:


```language-bash
$ cd $SPARK_HOME
$ bin/run-example SparkPi 10
``` <p>

```language-bash
Pi is roughly 3.13918
``` <p>

**Examples on Pyspark and Scala Apps:**

Examples on **[PySpark](https://sysadmins.co.za/spark-pyspark-examples/)** and **[Scala]()** applications can be found from the links.