CSC 555 and DSC 333 Mining Big Data Assignment 2 solution




5/5 - (3 votes)

Reading: Mining of Massive Datasets: Chapter 2.
Hadoop: The Definitive Guide: Chapter 17 (Hive), Appendix A (file also available on D2L).
Supplemental document UsingAmazonAWS.doc and Instructions_ReformatHDFS_Hive.doc
The reformatting instructions are included in case you have to re-initialize your AWS instance. You would only need it if you need to reformat your Hadoop set up.
Part 1
1) Describe how you would implement a MapReduce job consisting of Map and Reduce description. You can describe it in your own words or as pseudo-code. Keep in mind that map task reads the input file and produces (key, value) pairs. Reduce task takes a list of (key, value) pairs for each key and combines all values for each key.
Remember that Map operates on individual blocks and Reduce on individual keys with a set of values. Thus, for Mapper you need to state what your code does given a block of data and for Reduce you need to state what your reducer does for each key. You can assume that all of the columns accessed by the query exist in the original table.

a) SELECT Last, MIN(Grade)
FROM Student

FROM Student
GROUP BY City, State;

2) Suppose you are tasked with analysis of the company’s web server logs. The log dump contains a large amount of information with up to 7 different attributes (columns). You regularly run a Hadoop job to perform analysis pertaining to 3 specific attributes – TimeOfAccess, OriginOfAccess and FileName out of 7 total in the file.

a) How would you attempt to speed up the regular execution of the query? (2-a is intentionally an open-ended question, there are several acceptable answers)

b) If a Mapper task fails while processing a block of data – which node(s) would be preferred to restart it?

c) If the job is executed with 2 Reducers
i) How many files does the output generate?
ii) Suggest one possible hash function that may be used to assign keys to reducers.

3) Consider a Hadoop job that processes an input data file of size equal to 65 disk blocks (65 different blocks, you can assume that HDFS replication factor is set to 1). The mapper in this job requires 1 minute to read and process a single block of data. For the purposes of this assignment, you can assume that the reduce part of this job takes zero time. You can also refer to the supplemental example on how to make this estimate.

a) Approximately how long will it take to process the file if you only had one Hadoop worker node? You can assume that that only one mapper is created on every node.
b) 10 Hadoop worker nodes?
c) 30 Hadoop worker nodes?
d) 100 Hadoop worker nodes?

e) Now suppose you were told that the replication factor has been changed to 3. That is, each block is stored in triplicate, but file size is still 65 blocks. Which of the answers (if any) in a)-d) above will have to change?

You can ignore the network transfer costs and other potential overheads as well as the possibility of node failure. State any assumptions you make.

Part 2: Linux Intro

This part of the assignment will serve as an introduction to Linux. Make sure you go through the steps below and submit screenshots where requested – submit the entire screenshot of a command terminal.

Use at least a t2.small instance or Hadoop may not run properly with insufficient memory.
All Linux commands are in Berlin Sans FB. Do not type the “$” symbol. The “$” represents the prompt “[ec2-user@ip-xxx-xx-xx-xxx ~] $ ” in your particular Linux instance.

0. Login to your Amazon EC2 Instance (NOTE: instructions on how to create a new instance and log in to it are provided in a separate file, UsingAmazonAWS.doc)

SUBMIT: The name of the instance that you have created.

Connect to your instance through PuTTy or a Mac terminal.

On Windows, your instance would look similar to the following image. On a Mac, you would instead get the same text in an XTerm terminal:

1. Create a text file.
Instructions for 3 different text editors are provided below. You only need to choose one editor that you prefer. nano is a more basic text editor, and is much easier to start. vim and emacs are more advanced and rely on keyboard shortcuts quite a bit and thus have a steeper learning curve.
• Tip: To paste into Linux terminal you can use right-click. To copy from the Linux terminal, you only need to highlight the text that you want to copy with your mouse. Also please remember that Linux is case-sensitive, which means Nano and nano are not equivalent.

Nano Instructions(Option 1):

$ nano myfile.txt

Type something into the file: “This is my text file for CSC555.”

Save changes: Ctrl-o and hit Enter.
Exit: Ctrl-x

Emacs Instructions (Option 2):

You will need to install emacs.
$ sudo yum install emacs
Type “y” when asked if this is OK to install.

$ emacs myfile.txt
Type something into the file: “This is my text file for CSC555.”
Save changes: Ctrl-x, Ctrl-s
Exit: Ctrl-x Ctrl-z

Vim Instructions(Option 3):
• NOTE: When vim opens, you are in command mode. Any key you enter will be bound to a command instead of inserted into the file. To enter insert mode press the key “i”. To save a file or exit you will need to hit Esc to get back into command mode.

$ vim myfile.txt
Type “i” to enter insert mode
Type something into the file: “This is my text file for CSC555.”
Save changes: hit Esc to enter command mode then type “:w”
Exit: (still in command mode) type “:x”

Confirm your file has been saved by listing the files in the working directory.
$ ls
You should see your file.
Display the contents of the file on the screen.
$ cat myfile.txt

Your file contents should be printed to the terminal.

• Tip: Linux will fill in partially typed commands if you hit Tab.
$cat myfi
Hit Tab and “myfi” should be completed to “myfile.txt”. If there are multiple completion options, hit Tab twice and a list of all possible completions will be printed. This also
applies to commands themselves, i.e. you can type in ca and see all possible commands that begin with ca.

2. Copy your file.

Make a copy.
$ cp myfile.txt mycopy.txt
Confirm this file has been created by listing the files in the working directory.
Edit this file so it contains different text than the original file using the text editor instructions, and confirm your changes by displaying the contents of the file on the screen.

SUBMIT: Take a screen shot of the contents of your copied file displayed on the terminal screen.

3. Delete a file

Make a copy to delete.
$ cp myfile.txt filetodelete.txt
$ ls

Remove the file.
$ rm filetodelete.txt
$ ls

4. Create a directory to put your files.

Make a directory.
$mkdir CSC555

Change the current directory to your new directory.
$cd CSC555

Print your current working directory

5. Move your files to your new directory.

Return to your home directory.
$cd ..
$ cd /home/ec2-user/

• NOTE: cd will always take you to your home directory. cd .. will move you up one directory level (to the parent). Your home directory is “/home/[user name]”, /home/ec2-user in our case

Move your files to your new directory.

$ mv myfile.txt CSC555/
$ mv mycopy.txt CSC555/

Change the current directory to CSC555 and list the files in this directory.

SUBMIT: Take a screen shot of the files listed in the CSC555 directory.

6. Zip and Unzip your files.

Zip the files.
$ zip myzipfile mycopy.txt myfile.txt
$ zip myzipfile *

• NOTE: * is the wildcard symbol that matches everything in current directory. If there should be any additional files in the current directory, they will also be placed into the zip archive. Wildcard can also be used to match files selectively. For example zip myzipfile my* will zip-up all files beginning with “my” in the current directory.

Move your zip file to your home directory.
$ mv /home/ec2-user/

Return to your home directory.
Extract the files.
$ unzip

SUBMIT: Take a screen shot of the screen after this command.

7. Remove your CSC555 directory.

• Warning: Executing “rm -rf” has the potential to delete ALL files in a given directory, including sub-directories (“r” stands for recursive). You should use this command very carefully.
Delete your CSC555 directory.
$ rm -rf CSC555/

8. Download a file from the web.

Download the script for Monty Python and the Holy Grail.
$ wget

The file should be saved as “grail” by default.

9. ls formats

List all contents of the current directory in long list format.
• Note: the option following “ls” is the character “l”; not “one”.
$ ls -l

The 1st column gives information regarding file permissions (which we will discuss in more detail later). For now, note that the first character of the 10 total will be “-“ for normal files and “d” for directories. The 2nd Column is the number of links to the file. The 3rd and 4th columns are the owner and the group of the file. The 5th column displays the size of the file in bytes. The 6th column is the date and time the file was last modified. The 7th column is the file or directory name.

List all contents of the current directory in long list and human readable formats. “-h” will put large files in more readable units than bytes.
$ ls -lh

SUBMIT: The size of the grail file.

10. More on viewing files.

If you issue “cat grail”, the contents of grail will be printed. However, this file is too large to fit on the screen.

Show the grail file one page at a time.
$ more grail
Hit the spacebar to go to the next page. Type “b” to go page up, hit “space” key to go page down. Type “q” to quit.


$ less grail
Less has more options than more (“less is more and more is less”). You can now use the keyboard Arrows and Page Up/Down to scroll. You can type “h” for help, which will display additional options.

View the line numbers in the grail file. The cat command has the -n option, which prints line numbers, but you may also want to use more to view the file one page at a time. A solution is to pipe the output from cat to more. A pipe redirects the output from one program to another program for further processing, and it is represented with “|”.

$ cat -n grail | more

Redirect the standard output (stdout) of a command.
$ cat myfile.txt > redirect1.txt
$ ls -lh > redirect2.txt

Append the stdout to a file.
$ cat mycopy.txt >> myfile.txt
mycopy.txt will be appended to myfile.txt.

• Note: “cat mycopy.txt > myfile.txt” will overwrite myfile.txt with the contents output by “cat mycopy.txt”. Thus using >> is crucial if you want to preserve the existing file contents.

11. Change access permissions to objects with the change mode command.

The following represent roles:
u – user, g – group, o – others, a – all

The following represent permissions:
r – read, w – write, x – execute

Remove the read permission for your user on a file.
$ chmod u-r myfile.txt
Try to read this file. You should receive a “permission denied” message because you are the user who owns the file.

SUBMIT: The screenshot of the permission denied error

Give your user read permission on a file. Use the same file you removed the read permission from.
$ chmod u+r myfile.txt
You should now be able to read this file again.

12. Python examples

Install Python if it is not available on your machine.

$ sudo yum install python

Create a Python file. These instructions will use Emacs as a text editor, but you can still chose the text editor you want.

$ emacs
(Write a simple Python program)
print “*”*25
print “My Lucky Numbers”.rjust(20)
print “*”*25

for i in range(10):
lucky_nbr = (i + 1)*2
print “My lucky number is %s!” % lucky_nbr

Run your Python program.
$ python

Redirect your output to a file
$ python > lucky.txt

Pipe the stdout from to another Python program that will replace “is” with “was”.
$ emacs

import sys

for line in sys.stdin:
print line.replace(“is”, “was”)

$ python | python

Write python code to read a text file (you can use myfile.txt) and output a word count total for each word with the number of times that word occurs in the entire file. That is, if the file has the word “Hadoop” occurs in the file 5 times, your code should print “Hadoop 5”. It should output the count of all words occurring in a file.

SUBMIT: The screen output from running your python code and a copy of your python code. Homework submissions without code will receive no credit.

Part 3: Wordcount
For this part of the assignment, you will run wordcount on a single-node Hadoop instance. I am going to provide detailed instructions to help you get Hadoop running. The instructions are following Hadoop: The Definitive Guide instructions presented in Appendix A: Installing Apache Hadoop.

You can download 2.6.4 from here. You can copy-paste these commands (right-click in PuTTy to paste, but please watch out for error messages and run commands one by one)
Install ant to list java processes
sudo yum install ant

(wget command stands for “web get” and lets you download files to your instance from a URL link)

(unpack the archive)
tar xzf hadoop-2.6.4.tar.gz

Modify the conf/ to add to it the JAVA_HOME configuration
You can open it by running (using nano or your favorite editor instead of nano).
nano hadoop-2.6.4/etc/hadoop/
Note that the # comments out the line, so you would comment out the original JAVA_HOME line replacing it by the new one as below.

NOTE: you would need to determine the correct Java configuration line by executing the following (underlined) command
[ec2-user@ip-172-31-16-63 ~]$ readlink -f $(which java)
which will output something like:

In my case, Java home is at (remove the bin/java from the path):

modify the .bashrc file to add these two lines:
export HADOOP_HOME=~/hadoop-2.6.4

.bashrc file contains environment settings to be configured automatically on each login. You can open the .bashrc file by running
nano ~/.bashrc

To immediately refresh the settings (that will be automatic on next login), run
source ~/.bashrc

Next, follow the instructions for Pseudodistributed Mode for all 4 files.

(to edit the first config file)
nano hadoop-2.6.4/etc/hadoop/core-site.xml

Make sure you paste the settings between the and tags, like in the screenshot below. NOTE: The screenshot below is only one of the 4 files, all files are different. The contents of each file are described in the Appendix A in the Hadoop book, the relevant appendix is also included with the homework assignment. I am also including a .txt file (HadoopConfigurationText) so that it is easier to copy-paste.

nano hadoop-2.6.4/etc/hadoop/hdfs-site.xml
(mapred-site.xml file is not there, run the following single line command to create it by copying from template. Then you can edit it as other files.)
cp hadoop-2.6.4/etc/hadoop/mapred-site.xml.template hadoop-2.6.4/etc/hadoop/mapred-site.xml
nano hadoop-2.6.4/etc/hadoop/mapred-site.xml
nano hadoop-2.6.4/etc/hadoop/yarn-site.xml

To enable passwordless ssh access (we will discuss SSH and public/private keys in class), run these commands:
ssh-keygen -t rsa -P ” -f ~/.ssh/id_rsa
cat ~/.ssh/ >> ~/.ssh/authorized_keys

test by running (and confirming yes to a one-time warning)
ssh localhost

Format HDFS (i.e., first time initialize)

hdfs namenode -format

Start HDFS, Hadoop and history server (answer a 1-time yes if you asked about host authenticity) start historyserver

Verify if everything is running:

(NameNode and DataNode are responsible for HDFS management; NodeManager and ResourceManager are serving the function similar to JobTracker and TaskTracker.)

Create a destination directory
hadoop fs -mkdir /data
Download a large text file using

Copy the file to HDFS for processing

hadoop fs -put bioproject.xml /data/

(you can optimally verify that the file was uploaded to HDFS by hadoop fs -ls /data)
Submit a screenshot of this command

Run word count on the downloaded text file, using the time command to determine the total runtime of the MapReduce job. You can use the following (single-line!) command. This invokes the wordcount example built into the example jar file, supplying /data/bioproject.xml as the input and /data/wordcount1 as the output directory. Please remember this is one command, if you do not paste it as a single line, it will not work.

time hadoop jar hadoop-2.6.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.4.jar wordcount /data/bioproject.xml /data/wordcount1

Report the time that the job took to execute as screenshot
(this reports the size of a particular file or directory in HDFS. The output file will be named part-r-00000)
hadoop fs -du /data/wordcount1/

(Just like in Linux, the cat HDFS command will dump the output of the entire file and grep command will filter the output to all lines that matches this particular word). To determine the count of occurrences of “arctic”, run the following command:

hadoop fs -cat /data/wordcount1/part-r-00000 | grep arctic

It outputs the entire content of part-r-00000 file and then uses pipe | operator to filter it through grep (filter) command. If you remove the pipe and grep, you will get the entire word count content dumped to screen, similar to cat command.

Congratulations, you just finished running wordcount using Hadoop.
Part 4: Hive Intro

4) In this section we are going to use Hive to run a few queries over the Hadoop framework. These instructions assume that you are starting from a working Hadoop installation. If you are starting your instance, you need to start Hadoop as well.
Hive commands are listed in Calibri bold font

a) Download and install Hive:
(this command is there to make sure you start from home directory, on the same level as where hadoop is located)
gunzip apache-hive-2.0.1-bin.tar.gz
tar xvf apache-hive-2.0.1-bin.tar

set the environment variables (can be automated by adding these lines in ~/.bashrc). If you don’t, you will have to set these variables every time you use Hive.
export HIVE_HOME=/home/ec2-user/apache-hive-2.0.1-bin
export PATH=$HIVE_HOME/bin:$PATH

$HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
(if you get an error here, it means that /user/hive does not exist yet. Fix that by running $HADOOP_HOME/bin/hadoop fs -mkdir -p /user/hive/warehouse instead)

$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

We are going to use Vehicle data (originally from

You can get the already unzipped, comma-separated file from here:

You can take a look at the data file by either
nano vehicles.csv or
more vehicles.csv (you can press space to scroll and q or Ctrl-C to break out)

Note that the first row in the data is the list of column names. What follows after commands that start Hive, is the table that you will create in Hive loading the first 5 columns. Hive is not particularly sensitive about invalid or partial data, hence if we only define the first 5 columns, it will simply load the first 5 columns and ignore the rest.
You can see the description of all the columns here (atvtype was added later)

Create the ec2-user directory on the HDFS side (absolute path commands should work anywhere and not just in Hadoop directory as bin/hadoop does). Here, we are creating the user “home” directory on the HDFS side.

hadoop fs -mkdir /user/ec2-user/

Run hive (from the hive directory because of the first command below):
$HIVE_HOME/bin/schematool -initSchema -dbType derby
(NOTE: This command initializes the database metastore. If you need to restart/reformat or see errors related to meta store, delete the metastore using rm -rf metastore_db/ and then repeat the above initSchema command)

You can now create a table by pasting this into the Hive terminal:

CREATE TABLE VehicleData (
barrels08 FLOAT, barrelsA08 FLOAT,
charge120 FLOAT, charge240 FLOAT,
city08 FLOAT)

You can load the data (from the local file system, not HDFS) using:

LOAD DATA LOCAL INPATH ‘/home/ec2-user/vehicles.csv’

(NOTE: If you downloaded vehicles.csv file into the hive directory, you have to change file name to /home/ec2-user/apache-hive-2.0.1-bin/vehicles.csv instead)

Verify that your table had successfully loaded by running
(Copy the query output and report how many rows you got as an answer.)

Run a couple of HiveQL queries to verify that everything is working properly:

SELECT MIN(barrels08), AVG(barrels08), MAX(barrels08) FROM VehicleData;
(copy the output from that query)

SELECT (barrels08/city08) FROM VehicleData;
(you do not need to report the output from that query, but report “Time taken”)

Next, we are going to output three of the columns into a separate file (as a way to transform data for further manipulation that you may be interested in)

SELECT barrels08, city08, charge120
FROM VehicleData;

You can now exit Hive by running exit;

And verify that the new output file has been created (the file will be called 000000_0)
The file would be created in HDFS in user home directory (/user/ec2-user/ThreeColExtract)

Report the size of the newly created file and include the screenshot.

Next, you should go back to the Hive terminal, create a new table that is going to load 8 columns instead of 5 in our example (i.e. create and load a new table that defines 8 columns by including columns city08U,cityA08,cityA08U) and use Hive to generate a new output file containing only the city08U and cityA08U columns from the vehicles.csv file. Report the size of that output file as well.
If the size is zero, you are looking at a directory and the output file(s) are in that directory.

Submit a single document containing your written answers. Be sure that this document contains your name and “CSC 555 Assignment 2” at the top.