Ken Lau

Student Seminar Presentation on using Hadoop for Big Data

Ken Lau
April 9, 2015

1. Overview

In this student seminar, I go over the fundamentals of Hadoop and map-reduce, and demonstrate how one can wrangle data using a distributed framework and file system. The style of this seminar will be similar to the other programming-related seminars where the first 15 minutes involves an overview of the topic, and the last 30 minutes an interactive component. A distributed computing environment is most useful when data is so large that it cannot be processed on a single machine.

The slides for the presentation part of the seminar can be viewed here.

Now turning our attention to the interactive part, anyone using Windows should consider downloading and installing Virtual Box and installing Linux in it. You can also use the link below and follow the instructions under "Preparation for the seminar". In summary, I provided a link to a linux image with Ubuntu 12.04 and it should work. Also, sign-up for an Amazon Web Service (AWS) account if you have not already done so. http://aws.amazon.com/.

The following instructions guide you to starting up your own cluster. The majority of the material is obtained from this Introduction to Data Science course. I highly recommend doing the section on Amazon Web Service if any.

2. Preparation for the seminar

Virual machine (VM) preparation:

If you are using a Linux or Mac, then you should be good to go. However, if you are using windows, then you should get Linux using a virtual machine (VM) by using Virtual Box. When you download Virtual Box, you don't immediately get Linux, instead you need to download Linux separately. The following is a download link to a linux ubuntu 12.04 32-bit image that will get you running linux on a Virtual Box.

Linux Ubuntu Image

Once you have downloaded this file, you should unzip it. You should then notice a file that ends with .vmdk in the same directory.

Setting up Linux on Virtual Box

Start up Virtual Box.
Click on "New" on the upper left corner of the interface.
In the "Name:" field, enter any name you want such as "Hadoop"
In the "Type:" field, choose Linux.
In the "Version:" field, choose Ubuntu.
Press "Next"
Choose a memory size, use 1024 MB for example, and then press next
Click on "Use an existing virtual hard drive file" browse to your .vmdk linux image file, choose it, and then click "Create"

Sign up for an AWS account

3. Tutorial component of seminar

AWS Preparation

Sign into AWS
At the top right, your account name should be there, click on the drop down menu beside it and select "US West (Oregon)".
Go to the main Amazon Web Services Page, if you're not sure click on the cube looking object at the top left of the screen.
Look for the name "EC2" under the Compute category and press it.
Scroll down the left side bar panel to find "Key Pairs" under the Network and Security category, and then click on "Key Pairs".
Click "Create Key Pair" button at the top.
Type in a Key pair name, say "seminar-key" is fine
Download the "seminar-key.pem" file and store it somewhere convenient. I have saved in /home/user/tutorial/seminar-key.pem
Navigate to the folder containing the .pem file, and execute the following command: "chmod 600 seminar-key.pem". This step ensures that you are the only one who can access this file.

Creating the cluster

Go back to the home page by clicking on the cube object at the top left.
Click on "EMR" under the Analytics category.
Click on "Create Cluster".
Enter a name for the "Cluster name" field for example "tutorial"
Uncheck the "Logging" box.
In the Software Configuration section, pick the AMI version to be 2.4.2, and remove Hive and Pig under the "Applications to be installed"
Under Hardware Configuration, choose Master node to be m1.medium under "General Purpose (Previous Generation)" and Core node to be m1.medium or m1.small.
Under "Security and Access", select the EC2 key pair to be the name of the pem file, for example, seminar-key.pem.
Scroll down to the very bottom and click "Create Cluster"

Connecting to the cluser

Start up a terminal
Navigate to the directory with the seminar-key.pem file
Run the following command where "ec2-XX-XX-XXX-XXX..." is the name of the server you're going to connect to. It can be found at the "Master Public DNS" field at the page where you created your cluster or under "Cluster List".
ssh -o "ServerAliveInterval 10" -i seminar-key.pem hadoop@ec2-XX-XX-XXX-XXX.us-west-2.compute.amazonaws.com

Running map-reduce

Once you have connected successfully, clone the following repository to get the data and scripts for this tutorial by using the following command:
git clone https://github.com/kenlau177/Student-Seminar
Navigate to the directory of the cloned repository
Create a directory in the hadoop distributed file system (HDFS) by executing:
hadoop fs -mkdir /user/hadoop
Create an input directory in HDFS:
hadoop fs -mkdir /user/hadoop/myinput
Put stream-data.txt into the input directory in HDFS:
hadoop fs -put stream-data.txt /user/hadoop/myinput
Run the map-reduce job

hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming-1.0.3.jar -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py -input /user/hadoop/myinput -output /user/hadoop/joboutput

Copy the files from HDFS to server.
hadoop fs -copyToLocal /user/hadoop/joboutput part-00000
Copy the results from server to local computer by running the following command from the local computer.
scp -o "ServerAliveInterval 10" -i seminar-key.pem hadoop@ec2-XX-XX-XXX-XXX.us-west-2.compute.amazonaws.com