Problem statement:

In this project, your task is to perform data analytics over a dataset of online social
networks using MRJob.

Input files:

The dataset contains users’ check-in history, in which each record is in format of
β€œuserID, locID, check_in_time”, where userID (string type) is the ID of a user, locID
(string type) is the ID of a location, and check_in_time is the timestamp of the user’s
check-in at this location. A sample file is like below:

This small sample file can be downloaded at:

Problem desciption:

We denote the number of check-ins at location π‘™π‘œπ‘π‘– by user 𝑒𝑗 as π‘›π‘™π‘œπ‘π‘–
and the
number of check-ins from 𝑒𝑗 as 𝑛𝑒𝑗
. Thus, 𝑛𝑒𝑗 = βˆ‘ π‘›π‘™π‘œπ‘π‘–
, where 𝐿𝑒𝑗
is the set
of locations visited by 𝑒𝑗

The probability that 𝑒𝑗 checked-in at π‘™π‘œπ‘π‘–
is computed as π‘π‘Ÿπ‘œπ‘π‘™π‘œπ‘π‘–
𝑒𝑗 =
. Your task
is to compute π‘π‘Ÿπ‘œπ‘π‘™π‘œπ‘π‘–

for each user at each location which has be visited by this user.
Output format:
Store the result in HDFS in format of: β€œπ‘™π‘œπ‘π‘–
, π‘π‘Ÿπ‘œπ‘π‘™π‘œπ‘π‘–
”. The results are first sorted
by location IDs in ascending order, and then by the user’s check-in probabilities in
descending order. If two users have the same probability, sort them by their IDs in
ascending order.

For example, given the above data set, the output is (there is no need to remove the
quotation marks which are generated by MRJob):
“l1” “u1,0.6667”
“l1” “u2,0.5”
“l2” “u3,0.6667”
“l2” “u1,0.3333”
“l3” “u2,0.5”
“l3” “u3,0.3333”

The entire output could be checked at:
Code format:
Please name your python file as β€œ” and compress it in a package named
β€œ” (e.g.
Command of running your code:

We will use more than one reducer to test your code. Assuming we use 2 reducers,
we will use the following command to run your code:
$ python3 -r hadoop hdfs_input -o hdfs_output –jobconf
In this command, hdfs_input is the input path on HDFS, and hdfs_output is the
output folder on HDFS.

The code template can be downloaded from:
Warning: Please ensure that the code you submit can be compiled. Any solution
that has compilation errors will receive no more than 4 points.

Marking Criteria:

Your source code will be inspected and marked based on readability and ease of
understanding. The documentation (comments of the codes) in your source code is
also important. Below is an indicative marking scheme:
Result correctness: 6
Algorithm design (the use of design patterns learned to
reduce memory consumption and to improve efficiency): 5
Code structure, Readability, and Documentation: 1


Deadline: Sunday 9th Oct 11:59:59 PM
You can submit through Moodle:
If you submit your assignment more than once, the last submission will replace the
previous one. To prove successful submission, please take a screenshot as assignment
submission instructions show and keep it by yourself. If you have any problems in
submissions, please email to
Late submission penalty
5% reduction of your marks for up to 5 days


The work you submit must be your own work. Submission of work partially or
completely derived from any other person or jointly written with any other person is
not permitted.

The penalties for such an offence may include negative marks,
automatic failure of the course and possibly other academic discipline. Assignment
submissions will be examined manually.

Relevant scholarship authorities will be informed if students holding scholarships are
involved in an incident of plagiarism or other misconduct.
Do not provide or show your assignment work to any other person – apart from the
teaching staff of this subject.

If you knowingly provide or show your assignment
work to another person for any reason, and work derived from it is submitted you
may be penalized, even if the work was submitted without your knowledge or