alt

This is part 2 of our Big Data Cluster Setup.

From our Previous Post I was going through the steps on getting your Hadoop Cluster up and running.

In this tutorial, we will setup Apache Hive, on top of the Hadoop Ecosystem.

Our cluster will consist of:

  • Ubuntu 14.04
  • Hadoop 2.7.1
  • HDFS
  • 1 Master Node
  • 3 Slave Nodes

After we have setup Hive we will also run a Hive example on parsing apache access logs.

This Big Data Series will cover:

  1. Setup Hadoop and HDFS
  2. Setup Hive
  3. Setup Pig
  4. Setup Spark
  5. Example PySpark Application
  6. Example Scala Application (Coming Soon)
  • Setup Hive and Pig (Coming Soon)
  • Setup Presto (Coming Soon)
  • Setup Impala (Coming Soon)

Lets get started with our setup:

Setup HIVE:

$ su hadoop
$ cd ~/
$ wget http://apache.is.co.za/hive/hive-1.2.2/apache-hive-1.2.2-bin.tar.gz
$ tar -xf apache-hive-1.2.2-bin.tar.gz
$ sudo mv apache-hive-1.2.2-bin /usr/lib/hive
$ sudo chown -R hadoop:hadoop /usr/lib/hive

Hive Configuration:

Configure environment variables, note that the only difference with RHEL based Operating Systems, is that we would then configure ~/.bashrc :

$ vi ~/.profile
[...]
# Set HIVE HOME
export HIVE_HOME="/usr/lib/hive"
export PATH=$PATH:$HIVE_HOME/bin
[...]

Edit hive-config.sh

$ vi /usr/lib/hive/bin/hive-config.sh
``` <p>

```bash
HIVE_CONF_DIR="${HIVE_CONF_DIR:-$HIVE_HOME/conf}"
export HIVE_CONF_DIR=$HIVE_CONF_DIR
export HIVE_AUX_JARS_PATH=$HIVE_AUX_JARS_PATH
export HADOOP_HOME=/usr/local/hadoop

Source Profile:

$ source ~/.profile

Create Hive Warehouse Directories:

$ hdfs dfs -mkdir -p /usr/hive/warehouse
$ hdfs dfs -chmod g+w /usr/hive/warehouse

Connecting to Hive:

$ hive
hive>

Example: Parsing Apache Access Logs:

First we will get some sample data:

$ mkdir ~/apachelogs
$ cd ~/apachelogs
$ wget ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz
$ wget ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz
$ gunzip NASA_access_log_Jul95.gz
$ gunzip NASA_access_log_Aug95.gz

Connect to Hive and Create the Table:

hive> CREATE TABLE serde_regex(
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;

Once we have created our table, we will load the data that we have downloaded:

hive> LOAD DATA LOCAL INPATH "/home/hadoop/apache_logs/NASA_access_log_Aug95" INTO TABLE serde_regex;

hive> LOAD DATA LOCAL INPATH "/home/hadoop/apache_logs/NASA_access_log_Jul95" INTO TABLE serde_regex;

Now we will query our data:

hive> SELECT * FROM serde_regex ORDER BY time limit 3;

To run this example as a script, create hive-parser.hql

$ vi hive-parser.hql

CREATE TABLE serde_regex(
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH "/home/hadoop/apache_logs/NASA_access_log_Aug95" INTO TABLE serde_regex;
LOAD DATA LOCAL INPATH "/home/hadoop/apache_logs/NASA_access_log_Jul95" INTO TABLE serde_regex;
SELECT * FROM serde_regex ORDER BY time;

To run the Hive Script, we execute:

$ hive -f hive-parser.hql

Setup External Metastore:

You can have a centralized MySQL Server to host your metastore for hive, in order to enable this, we will continue with the setup:

In order to connect to a MySQL Server, we need to install the Java MySQL Connector:

$ sudo apt-get install libmysql-java -y

Copy the connector over to your Hive lib directory:

$ sudo cp /usr/share/java/mysql-connector-java.jar usr/lib/hive/lib/mysql-connector-java.jar
``` <p>

Connect to your MySQL Server and load the hive-schema sql script:

```bash
$ mysql -h hostname.domain.com -u root -p
mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-1.2.0.mysql.sql

Create the user that will connect to the MySQL Server, in this case we used the user hive with the password hive

mysql> CREATE USER 'hive'@'%' IDENTIFIED BY 'hive';
mysql> GRANT all on metastore.* to 'hive'@'ip-hadoop-master' IDENTIFIED by 'hive';
mysql> FLUSH PRIVILEGES;
mysql> exit; 

Create the hive-site.xml under the Hive conf directory:

$ vi /usr/lib/hive/conf/hive-site.xml
<configuration>

   <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://ip-mysql-server/metastore?createDatabaseIfNotExist=true</value>
      <description>metadata is stored in a MySQL server</description>
   </property>

   <property>    <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.jdbc.Driver</value>
      <description>MySQL JDBC driver class</description>
   </property>

   <property>      <name>javax.jdo.option.ConnectionUserName</name>
      <value>hive</value>
      <description>MySQL User</description>
   </property>

   <property>    <name>javax.jdo.option.ConnectionPassword</name>
      <value>hive</value>
      <description>MySQL Password</description>
   </property>
</configuration>

That concludes the setup of Hive on our Hadoop Cluster