This is part 4 of our Big Data Cluster Setup.

From our Previous Post I was going through the steps on setting up Spark on your Hadoop Cluster.

In this tutorial, we will setup Apache Pig, on top of the Hadoop Ecosystem.

Our cluster will consist of:

  • Ubuntu 14.04
  • Hadoop 2.7.1
  • 1 Master Node
  • 3 Slave Nodes

After we have setup Pig we will also run a Pig example.

This Big Data Series will cover:

  1. Setup Hadoop and HDFS
  2. Setup Hive
  3. Setup Spark
  4. Setup Pig
  5. Example PySpark Application
  6. Example Scala Application (Coming Soon)
  • Setup Hive and Pig (Coming Soon)
  • Setup Presto (Coming Soon)
  • Setup Impala (Coming Soon)

Lets get started with our setup:

Setup Pig:

$ wget http://apache.claz.org/pig/pig-0.15.0/pig-0.15.0.tar.gz
$ wget http://apache.claz.org/pig/pig-0.15.0/pig-0.15.0-src.tar.gz
$ mkdir /usr/local/pig
$ sudo chown hadoop:hadoop /usr/local/pig -R
$ tar -xf pig-0.15.0.tar.gz -C /usr/local/pig/
$ tar -xf pig-0.15.0-src.tar.gz -C /usr/local/pig/
``` <p>

**Setup Environment Variables:**

vi ~/.profile
``` <p>

export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_HOME/etc/hadoop
``` <p>

**Pig Properties:**

View supported porperties for the config /usr/local/pig/conf/pig.properties bu executing:

$ pig -h properties
``` <p>

**Verify Installation:**

pig –version
``` <p>

**Basic Pig Script:**

Let's test our installation by running a very basic pig script.

Generate the data:

cat > file.txt << EOF
``` <p>

Push the data to HDFS:

hdfs dfs -mkdir -p /pig/input
hdfs dfs -copyFromLocal file.txt /pig/input/
``` <p>

Create the Pig script:

cat > script.pig << EOF
dsetLOAD = LOAD '/pig/input/' USING PigStorage(',') AS (id:int, name:chararray, surname:chararray, location:chararray);
dsetGENERATE = FOREACH dsetLOAD GENERATE id, name, surname, location;
``` <p>

==Note==: You can run the Pig script in 3 modes:

$ pig -x local file.pig
``` <p>

$ pig file.pig
``` <p>


$ pig -x tez file.pig
``` <p>

**Executing the Script:**

We will run our script in HDFS mode since our data resides in HDFS:

$ pig script.pig
``` <p>