Sale!

CSC 555 / DSC 333 Mining Big Data Assignment 4 solution

$30.00 $25.50

Category:

Description

5/5 - (3 votes)

1)Describe how to implement the following query in MapReduce terms (you don’t have to code anything for that part):

select sum(lo_revenue), d_year, p_brand1
from lineorder, dwdate, part, supplier
where lo_orderdate = d_datekey
and lo_partkey = p_partkey
and lo_suppkey = s_suppkey
and p_category = ‘MFGR#12’
and s_region = ‘AMERICA’
group by d_year, p_brand1
order by d_year, p_brand1;

2) Consider a Hadoop job that will result in 93 blocks of output to HDFS.
Suppose that writing one output block to HDFS takes 1 minute. The HDFS replication factor is set to 3 unless otherwise noted (the cost of writing output should include the replication blocks, although the copies are written by HDFS rather than by reducer itself. Reducer produces the computation result which is then written to HDFS).

a) How long will it take for the reducer to write the job output on a 5-node Hadoop cluster? (ignoring the cost of Map processing, but counting replication cost in the output writing).

b) How long will it take for reducer(s) to write the job output to 20 Hadoop worker nodes? (Assume that data is distributed evenly and replication factor is set to 1)

c) How long will it take for reducer(s) to write the job output to 20 Hadoop worker nodes? (Assume that data is distributed evenly and replication factor is set to 3)

d) How long will it take for reducer(s) to write the job output to 100 Hadoop worker nodes? (Assume that data is distributed evenly and replication factor is set to 1)

e) How long will it take for reducer(s) to write the job output to 100 Hadoop worker nodes? (Assume that data is distributed evenly and replication factor is set to 3)

You can ignore the network transfer costs as well as the possibility of node failure.

3) Given the following keys: 1, 4, 6, 8, 11, 12, 15, 16, 17, 25, 26, 29, 50, 52, 59, 88, 89, 95, 98, design the following:

a) A distribution of these keys across 3 reducers using the default key partitioner (% 3)

b) Design a custom sorting partitioner instead of the default one and describe the resulting output across the same 3 reducers

c) What is the downside (i.e., extra overhead) of employing a custom partitioner?

4) Implement the following query using Hadoop streaming and python with the lineorder table.
http://cdmgcsarprd01.dpu.depaul.edu/CSC555/SSBM1/SSBM_schema_hive.sql
http://cdmgcsarprd01.dpu.depaul.edu/CSC555/SSBM1/lineorder.tbl

SELECT lo_shipmode, STDDEV(lo_tax)
FROM lineorder
WHERE lo_quantity BETWEEN 16 AND 28
GROUP BY lo_shipmode;

STDDEV is standard deviation. Don’t forget to submit your python code and the command lines you used to execute Hadoop streaming. I also recommend submitting the screenshot of execution to simplify the grader’s job.

5) (We will discuss compression on Tuesday, 5/3) Given the following data: QQQQZZZZZAAAANNN and assuming that each letter takes 1 byte (8 bits). The goal of this exercise is to quantify compression effects, while the math itself is completely trivial.

a) What is the storage size of the uncompressed string?

b) Suppose you apply Run Length Encoding compression (e.g., replace QQQQ by Q4). What is the size of the RLE-compressed string? You can assume that 5 also requires 1 byte just as Q.

c) Suppose you build a dictionary that represents each letter with a 5-bit code. What is the size of the dictionary-compressed string (where each letter is replaced by a 5 bit code)?

d) Repeat the computation in 3-b and 3-c using QQBQQZZUZZAANN string (which only has 14 characters and thus takes 14 bytes uncompressed)

6) In this section you will practice using HBase. Note that HBase runs on top of HDFS, bypassing the MapReduce engine.
cd
(Download HBase)
wget http://dbgroup.cdm.depaul.edu/Courses/CSC555/hbase-0.90.3.tar.gz
gunzip hbase-0.90.3.tar.gz
tar xvf hbase-0.90.3.tar
cd hbase-0.90.3

(Start HBase service, there is a corresponding stop service and this assumes Hadoop home is set)
bin/start-hbase.sh
(Open the HBase shell – at this point jps should show HMaster)
bin/hbase shell
(Create an employee table and two column families – private and public. Please watch the quotes, if ‘ turns into ‘, the commands will not work)
create ’employees’, {NAME=> ‘private’}, {NAME=> ‘public’}
put ’employees’, ‘ID1’, ‘private:ssn’, ‘111-222-334′
put ’employees’, ‘ID2’, ‘private:ssn’, ‘222-338-446′
put ’employees’, ‘ID3’, ‘private:address’, ‘123 State St.’
put ’employees’, ‘ID1’, ‘private:address’, ‘243 N. Wabash Av.’
scan ’employees’

Now that we have filled in a couple of values, add 3 new columns to the private family, 1 new column to the public family and create a brand new family with at least 2 columns. For each of these you should introduce at least 2 values — so a total of (3+1+2) * 2 = 12 values inserted. Verify that the table has been filled in properly with scan command and submit a screenshot.
NOTE: In order to add a new column family to an HBase table called test, you would need to run the following commands
disable ‘test’
alter ‘test’, ‘myNewFavoriteColumnFamily’
enable ‘test’

Submit a single document containing your written answers. Be sure that this document contains your name and “CSC 555 Assignment 4” at the top.