## Description

## Question 1. Concepts (4 marks)

(a) (2 marks) In what kind of problems a combiner class and a reducer class can be used

interchangeably? Please use an example to explain your answer.

(b) (2 marks) In one project, a student complained that her approach took a lot of time at the step when

using the reduce() function, but all the previous operations including reading the data by textFile(),

filtering the data by filter(), and transform the data by map() and flatmap(). Could you please explain

the reason to her?

## Question 2. MapReduce Programming (14 marks)

Requirement: You should explain how the input is mapped into (key, value) pairs by the map stage, i.e.,

specify what is the key and what is the associated value in each pair, and how the key(s) and value(s) are

computed.

Then you should explain how the (key, value) pairs produced by the map stage are processed

by the reduce stage to get the final answer(s). You only need to provide the pseudo code for the classes

including Mapper and Reducer (optionally Combiner etc.if necessary, and the efficiency of your method

will be considered).

(a) (4 marks) Given a table shown as below, find out the person(s) with the maximum salary in each

department (employees could have the same salary).

EmployeeID Name DepartmentID Salary

001 Emma 1 100,000

002 Helen 2 85,000

003 Jack 3 85,000

004 James 1 110,000

2

(b) (10 marks) Problem Background: Given an undirected graph G, its “line graph” is another graph L(G)

that represents the adjacencies between edges of G, such that:

• each vertex of L(G) represents an edge of G; and

• two vertices of L(G) are adjacent if and only if their corresponding edges share a common

endpoint (“are incident”) in G.

The following figures show a graph (left) and its line graph (right). Each vertex of the line graph is

shown labelled with the pair of endpoints of the corresponding edge in the original graph. For instance,

the vertex on the right labelled (1,3) corresponds to the edge on the left between the vertices 1 and 3.

Vertex (1,3) is adjacent to three other vertices: (1,2) and (1,4) (corresponding to edges sharing the

endpoint 1 in G) and (3,4) (corresponding to an edge sharing the endpoint 3 in G). Note that the vertex

(4, 3) in the below example should be (3, 4) in the output.

Problem: Given you the adjacency list of an undirected graph G, use MapReduce to generate the

adjacency list of its line graph L(G). Note that each edge connecting two nodes i and j is represented by (i,

j) in L(G) (if i<j). In the output, the edges in each list should be ranked in ascending order by comparing

the first node and then the second node.

The adjacency lists should be ranked by the keys according to

the same order as well. Take the above figure as an example, sample input and output are as below:

Input: Output:

1: 2, 3, 4

2: 1, 5

3: 1, 4

4: 1, 3, 5

5: 2, 4

(1, 2): (1, 3), (1, 4), (2, 5)

(1, 3): (1, 2), (1, 4), (3, 4)

(1, 4): (1, 2), (1, 3), (3, 4), (4, 5)

(2, 5): (1, 2), (4, 5)

(3, 4): (1, 3), (1, 4), (4, 5)

(4, 5): (1, 4), (2, 5), (3, 4)

3

## Question 3. Spark Programming (14 marks)

Provide the PySpark code for the given problems (minor errors are acceptable).

(a) (7 marks) RDD programming: Given a set of marks from different courses (the input format is as

shown in the left column), the task is to: For each student, get his/her ranking in different courses. The

output format is <student_id: course_name, rank>.

Sort the output by student_id first and then by

course_name (the format is as shown in the right column).

Input: Output:

student1 course1 90

student1 course2 92

student1 course3 80

student1 course4 79

student1 course5 93

student2 course1 92

student2 course2 77

student2 course5 85

student3 course3 64

student3 course4 97

student3 course5 82

student1: course1,2

student1: course2,1

student1: course3,2

student1: course4,2

student1: course5,1

student2: course1,1

student2: course2,2

student2: course5,2

student3: course3,2

student3: course4,1

student3: course5,3

(b) (7 marks) DataFrame programming (RDD APIs not allowed): Given the same input (but different

format!) as in problem (a), compute average marks for every course and sort the result by course_name

in alphabetical order.

Input: Output:

student1:course1,90;course2,92;course3,80;course4,79;course5,93

student2:course1,92;course2,77;course5,85

student3:course3,64;course4,97;course5,82

course1:91

course2:84.5

course3:72

course4:88

course5:86.67

4

## Question 4. Finding Similar Items (6 marks)

(a) (2 marks) Given two documents A = (“the sky is dark the moon is bright”) and B = (“the moon in the

sky is bright”), using the words as tokens, compute the 2-shingles for A and B, and then compute their

Jaccard similarity based on their 2-shingles.

(b) (3 marks) We want to compute min-hash signature for two columns, C1 and C2 using two pseudorandom permutations of columns using the following function:

h1(n) = (5n + 2) mod 7

h2(n) = (3n + 1) mod 7

Here, n is the row number in original ordering. Instead of explicitly reordering the columns for each hash

function, we use the implementation discussed in class, in which we read each data in a column once in

a sequential order, and update the min hash signatures as we pass through them.

Complete the steps of the algorithm and give the resulting signatures for C1 and C2.

(c) (1 marks) Suppose we wish to find similar sets, and we do so by minhashing the sets 10 times and

then applying locality-sensitive hashing using 5 bands of 2 rows (minhash values) each. If two sets had

Jaccard similarity 0.6, what is the probability that they will be identified in the locality-sensitive hashing

as candidates (i.e. they hash at least once to the same bucket)?

You may assume that there are no

coincidences, where two unequal values hash to the same bucket. A correct expression is sufficient: you

need not give the actual number.

5

## Question 5. Mining Data Streams (6 marks)

(a) (3 marks) Counting Bloom Filter

Consider a Counting Bloom filter of size m = 7 and 2 hash functions that both take a string (lowercase) as

input:

h1(str) = ∑𝑐 𝑖𝑛 𝑠𝑡𝑟(𝑐 − ′𝑎′) mod 7

h2(str) = (str.length * 2) mod 7

Here, c – ‘a’ is used to compute the position of the letter c in the 26 alphabetical letters, e.g., h1(“bd”) =

(1 + 3) mod 7 = 4.

(i) (2 marks) Given a set of string S = {“hi”, “big”, “data”, “spark”}, show the update of the

Bloom filter

(ii) (1 mark) Delete “hi” from S, and then use the bloom filter to check if “sql” is contained in S.

(b) (3 marks) CM-Sketch

Assume that we have 5 buckets and three hash functions:

h0(str) = (str.length * 2) mod 5

h1(str) = (str.length + 3) mod 5

h2(str) = (str[0]-‘a’) mod 5

Given you a stream of terms: “big”, “data”, “data”, “hadoop”, “data”, “spark”, show the steps of building

the CM-Sketch. Then, use the built CM-sketch to get the count for word “data”.

6

### Question 6. Link Analysis (6 marks)

Given a directed graph G with the set of nodes {1,2,3,4,5,6} and the edges arranged as below:

Using the MapReduce PageRank algorithm (lecture slides 9.50 and 9.51), show the computation process

in the first two rounds (including the mapper input, mapper output, reducer input, and reducer output).

End Of Paper