Name: CS 6240: Assignment 3 solution
SKU: 26135
Price: 30.00 USD
Availability: InStock

Description

5/5 - (7 votes)

Goal: Implement PageRank in MapReduce to explore the behavior of an iterative graph algorithm.
This homework is to be completed individually (i.e., no teams). You have to create all deliverables
yourself from scratch. In particular, it is not allowed to copy someone else’s code or text and modify it.
(If you use publicly available code/text, you need to cite the source in your code and report!)
Please submit your solution through Blackboard by the due date shown online. For late submissions you
will lose one percentage point per hour after the deadline. This HW is worth 100 points and accounts for
15% of your overall homework score. To encourage early work, you will receive a 10-point bonus if you
submit your solution on or before the early submission deadline stated on Blackboard. (Notice that your
total score cannot exceed 100 points, but the extra points would compensate for any deductions.)
Always package all your solution files, including the report, into a single standard ZIP file. Make sure
your report is a PDF file.
To enable the graders to run your solution, make sure your project includes a standard Makefile with
the same top-level targets (e.g., alone and cloud) as the one Joe presented in class (see the Extra
Material folder in the Syllabus and Course Resources section). You may simply copy Joe’s Makefile and
modify the variable settings in the beginning as necessary. For this Makefile to work on your machine,
you need Maven and make sure that the Maven plugins and dependencies in the pom.xml file are
correct. Notice that in order to use the Makefile to execute your job elegantly on the cloud as shown by
Joe, you also need to set up the AWS CLI on your machine. (If you are familiar with Gradle, you may also
use it instead. However, we do not provide examples for Gradle.)
As with all software projects, you must include a README file briefly describing all of the steps necessary
to build and execute both the standalone and AWS Elastic MapReduce (EMR) versions of your program.
This description should include the build commands, and fully describe the execution steps. This
README will also be graded.
PageRank in MapReduce
For this assignment, we are going to apply PageRank to Wikipedia articles in order to find the most
referenced ones. In practice this requires several non-trivial pre-processing steps. For example, the
Wikipedia dumps might be compressed in a format not (well-) supported by Hadoop. In principle it is
easy to find hyperlinks in HTML, doing so requires some experience with document parsers. As some
links are not interesting, an understanding of the domain document schema is necessary but takes a
significant amount of time. Dealing with such obstacles is part of the real data analysis experience, but
since we want you to focus on parallel data processing algorithms, we decided to help you get started
more quickly.
We transformed the original Wikipedia 2006 data dump into Hadoop-friendly bz2-compressed files and
made them available at:
https://drive.google.com/drive/folders/1IIySfwwyvup2cy2bP4BfFaTYoFUSbWlK?usp=sharing
Use the simple 2006 English dataset for local development: wikipedia-simple-html.bz2
Use the four files comprising the full 2006 English dataset for final evaluation on EMR:
wikipedia–html.?.bz2
.bqz2 File Format
The bz2 compression format works well for parallel computation, because it reduces size while
remaining “splittable”, meaning that it can be processed in chunks in parallel like uncompressed data.
Other compression formats such as zip require centralized decompression. Each line of the bz2 file is
formatted to contain the name of the Wikipedia page, a colon (:), and then the full contents of the page
on a single line. (Note that there may be a header and/or footer—not shown below—bracketing the
sequence of pages.)
:
:
…
Wikipedia HTML Format
You need to convert the Wikipedia data into an adjacency-list based graph representation, using the
original page names as the node IDs. The distracting nuisance components (e.g., file path prefix and
.html suffix) must be removed. The Wikipedia files are in XHTML so they may be parsed by either an
HTML or XML parser. An example parser will be made available together with this assignment to help
you get started. Feel free to use and modify it. It is important to only keep hyperlinks from within the
bodyContent div tag of the document and ignore all others.
For example: