Sale!

DSCI 551 HW5 (Hadoop MapReduce & Spark) solution

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (3 votes)

In this homework, we will consider the churn data set again (as in hw1). You are given two versions of
the file: churn4hadoop.csv and churn.csv. The former has not header, to be used for Hadoop quesQon
below; the laSer has header used in Spark.
1. [Hadoop MapReduce, 40 points] Complete the provided Churn.java by supplying the missing code as
indicated in the source file, so that it answers the following SQL query.
Select InternetService, max(tenure)
From Churn
Where churn = “Yes”
Group by InternetService
Having count(*) > 200;
ExecuQon format: hadoop jar churn.jar Churn input output
Where the input directory contains a single file: churn4hadoop.csv.
2. [40 points] For each of the following SQL queries, write a Spark script that finds the answer to the
query. Note to read a csv file with header into Spark as a dataframe, proceed as follows:
churn = spark.read.csv(‘churn.csv’, header=True)
You will also need to import this:
import pyspark.sql.funcQons as fc
a) select count(*)
from churn
where gender = ‘Male’ and churn = ‘Yes’;
b) select gender, max(TotalCharges)
from churn
where churn = “Yes”
group by gender;
Note: you will need to change the data type of TotalCharges from string to double. For example,
churn = churn.withColumn(‘TotalCharges’, fc.col(‘TotalCharges’).cast(‘double’))
c) select gender, count(*)
from churn
where churn = ‘Yes’
group by gender;
d) select churn, contract, count(*) cnt
from churn
group by churn, contract
order by churn, cnt desc;
(churn is ascending)
e) select gender, churn, count(*)
from churn
group by gender, churn
having count(*) > 1000;
3. [20 points] Write a Spark RDD script for each of the following SQL queries.
a. Same as q2.a.
b. Same as q2.b.
Submission:
• Q1: Churn.java and churn.jar and part-r-00000 under the output directory.
• Q2: submit a text file q2-soluQon.txt with your scripts and outputs from each script.
• Q3: submit a text file q3-soluQon.txt with your scripts and outputs from each script.