Description

5/5 - (5 votes)

Goals: Gain hands-on experience with parallel computation and synchronization primitives in a singlemachine shared-memory environment. Set up Hadoop on your development machine, become familiar
with AWS, and start working with MapReduce.
This homework is to be completed individually (i.e., no teams). You have to create all deliverables
yourself from scratch. In particular, you are not allowed to look at or copy someone else’s code/text and
modify it. (If you use publicly available code/text, you need to cite the source in your report!)
Please submit your solution through Blackboard by the due date shown online. For late submissions you
will lose one percentage point per hour after the deadline. This HW is worth 100 points and accounts for
15% of your overall homework score. To encourage early work, you will receive a 10-point bonus if you
submit your solution on or before the early submission deadline stated on Blackboard. (Notice that your
total score cannot exceed 100 points, but the extra points would compensate for any deductions.)
Always package all your solution files, including the report, into a single standard ZIP file. Make sure
your report is a PDF file.
For each program submission, include complete source code, build scripts, and small output files. Do not
include input data, output data over 1 MB, or any sort of binaries such as JAR or class files.
The following is optional for this assignment, but will be required starting with the next: To enable the
graders to run your solution, make sure you include a standard Makefile with the same top-level targets
(e.g., alone and cloud) as the one Joe presented in class (see the Extra Material folder in the Syllabus and
Course Resources section). You may simply copy Joe’s Makefile and modify the variable settings in the
beginning as necessary. For this Makefile to work on your machine, you need Maven and make sure that
the Maven plugins and dependencies in the pom.xml file are correct. Notice that in order to use the
Makefile to execute your job elegantly on the cloud as shown by Joe, you also need to set up the AWS
CLI on your machine.
We strongly recommend using a Makefile already for this assignment’s MapReduce part. Try to make it
work as shown by Joe in class. This will save you time in the long run and prepare you for the next
assignments; and using such build tools is an important skill in industry. If you are familiar with Gradle,
you may also use it instead. However, we do not provide examples for Gradle.
As with all software projects, you must include a README file briefly describing all of the steps necessary
to build and execute both the standalone and AWS Elastic MapReduce (EMR) versions of your program.
This description should start with unzipping the original submission, include the build commands, and
fully describe the execution steps. This README will also be graded and you will be able to reuse it on all
of this semester’s assignments with little modification (assuming you keep your project layout the
same).
You have about 2 weeks to work on this assignment. Section headers include recommended timings,
e.g., “complete in week 1”, to help you schedule your work. Of course, the earlier you work on this, the
better.
Optional But Highly Recommended: Set Up Github (Week 1)
Starting with HW 2, you will have to use the CCIS Github when developing your code. It is optional for
this HW, but we strongly recommend to set things up already. It shouldn’t take long, and this way we
can also already give you feedback about your use of Github.
Find the CCIS Github (not the public Github!) server and create a project for the multi-threaded Java
program discussed in the next section. Use an IDE like Eclipse with the corresponding Github plugin to
pull and push your code updates. Make sure you do the following:
 Set all your projects for this course so that they are private in general, but accessible to the TAs
and instructor. We will post our CCIS logins soon.
 Make sure you commit and push changes regularly. As a rule of thumb, the “delta” between
consecutive snapshots of your source code should be equivalent to about 2 hours’ worth of
coding. We do not want to see you committing large, complete chunks of code that look like you
just copied from someone else. We will provide feedback when grading your HW 1 solution.
Analyze Data with Sequential and Concurrent Java Programs
(Complete in Week 1)
For this assignment we will be working with climate data from NOAA. You can get the data here:
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/
You may develop your program using a smaller input file but final performance tests must use the
1912.csv.gz file as input. The goal is to compute the average TMAX temperature for each station ID,
using a single multi-core machine (e.g., your personal laptop).
Program Requirements
A. To eliminate the effect of I/O, create a 1-to-1 in-memory copy of the input file as follows: Write
a loader routine that takes an input filename, reads the file, and returns a String[] or List
containing the lines of the file. Do not change the content or order of the lines in any way. This
array or list is the starting point for all of the versions of the program described in B and C.
B. Write five versions of the program. All of them will parse the lines of the file and then calculate
the average TMAX temperature by station. A record will usually have format (StationId, Date,
Type, Reading,…), but, as in any real data set, there could be errors or missing values. (We
recommend you try to find documentation for the data and consult it.) Note that the
computation should ignore all records (e.g., TMIN, PRCP) that are not TMAX records.
1. SEQ: Sequential version that calculates the average of the TMAX temperatures by
station Id. Pick a data structure that makes sense for grouping the accumulated
temperatures and count of records by station. We will refer to this as the accumulation
data structure.
For the threaded versions, spawn about as many worker threads as your machine has
physical cores and assign about the same amount of work to each. You want to make sure
that there are at least 2 concurrent worker threads. If your machine has k>2 cores, we
recommend using k-1 worker threads. (This way the main process thread and OS
background processes have another core available.)
2. NO-LOCK: Multi-threaded version that assigns subsets of the input String[] (or
List) for concurrent processing by separate threads. This version should use a
single shared accumulation data structure and should use no locks or synchronization
on it, i.e., it completely ignores any possible data inconsistency due to parallel
execution.
3. COARSE-LOCK: Multi-threaded version that assigns subsets of the input String[] (or
List) for processing by separate threads. This version should also use a single
shared accumulation data structure and can only use the single lock on the entire data
structure. Design your program to ensure (1) correct multithreaded execution and (2)
minimal delays by holding the lock only when absolutely necessary.
4. FINE-LOCK: Multi-threaded version that assigns subsets of the input String[] (or
List) for processing by separate threads. This version should also use a single
shared accumulation data structure, but should, whenever possible, lock only the
accumulation value objects and not the whole data structure. Design your program to
ensure (1) correct multithreaded execution and (2) minimal delays by holding the locks
only when absolutely necessary. You may try to accomplish this using a data structure
which will avoid data races.
5. NO-SHARING: Per-thread data structure multi-threaded version that assigns subsets of
the input String[] (or List) for processing by separate threads. Each thread
should work on its own separate instance of the accumulation data structure. Hence no
locks are needed. However, you need a barrier to determine when the separate threads
have terminated and then reduce the separate data structures into a single one using
the main thread.
For each of the above versions, use System.currentTimeMillis() within your code to time its
execution. Note that you should not time the file loading routine (i.e., step A above), nor should
you time the printing of results to stdout. Carefully use barriers for multi-threaded code to time
only the record parsing and per station average calculations. Time the execution of your
calculation code 10 times in a loop within the same execution of the program, after loading the
data. Output the average, minimum, and maximum running time observed.
C. Now we want to see what happens when computing a more expensive function. For simplicity,
we use the following trick to simulate this with minimal programming effort: Modify the value
accumulation structure to slow down the updates by executing Fibonacci(17) whenever a
temperature is added to a station’s running sum. If this update occurs in a synchronized method
or while holding a lock, this Fibonacci evaluation should also occur in that method / with that
lock. Time all five versions exactly as in B above.
Sign Up For an AWS Account (Complete in Week 2)
Go to Amazon Web Services (AWS) and create an account. You should be able to find out how to do this
on your own. Enter the requested information and you are ready to go. Make sure you apply for the
education credit.
Set Up the Local Development Environment (Complete in Week 2)
We recommend using Linux for Hadoop development. (We had problems with Hadoop on Windows.) If
your computer is a Windows machine, you can run Linux in a virtual machine. We tested Oracle
VirtualBox: install VirtualBox (free) and create a virtual machine running Linux, e.g., Ubuntu (free). (If
you are using a virtual machine, then you need to apply the following steps to the virtual machine.)
Disclaimer: The following procedure is just one way for setting up your environment. You are free to use
a different approach, as long as you are able to develop, test, and then run your code on AWS. For
example, some people find the Cloudera distribution of Hadoop and Spark useful. Also consider already
installing both Hadoop and Spark, as we will use Spark in a future assignment.
Download a Hadoop 2 distribution, e.g., version 2.7.3, directly from http://hadoop.apache.org/ and
unzip it in your preferred directory, e.g., /usr/local. That’s almost all you need to do to be able to run
Hadoop code in standalone (local) mode from your IDE, e.g., Eclipse or IntelliJ. Make sure your IDE
supports development in Java. Java 1.7 and 1.8 should both work.
In your IDE, you should create a Maven project. This makes it simple to build “fat jars”, which recursively
include dependent jars used in your MapReduce program. There are many online tutorials for installing
Maven and also creating Maven projects via archetypes. These projects can be imported into your IDE or
built from a shell. The provided example pom.xml file is sufficient for this WordCount assignment.
Running Word Count (Complete in Week 2)
Find the example Word Count code in the Hadoop release you downloaded. Get it to run in the IDE on
your development machine. Notice that you will need to provide an input directory containing text files
and a path to an output directory (that directory should not exist). Once the program runs fine, look at it
closely and see what Map and Reduce are doing. Use the debugging perspective, set breakpoints in the
map and reduce functions, then experiment by stepping through the code to see how it is processing
the input file. Make sure you work with a small data sample.
To run your MapReduce program on EMR, follow the instructions in the AWS Setup Google Doc, linked
below. In short, you need to tell EMR what jar file you want to run, what its input parameters are, where
the input directory is located, and where the output should go (some location on S3). Use the input data
file for Assignment 1 at http://www.ccs.neu.edu/course/cs6240f14/. (Ignore the other data files there—
they are from previous years.) Unzip it and upload into an S3 bucket in your Amazon account. To access
the data from Elastic MapReduce, provide the bucket location, e.g., s3n://myPath/hw1.in, as the
commandline parameter for the input path of your job. Before writing any MapReduce code, try to look
at the content of the file. It is in plain text format.
If you are working with Maven and have set up the CLI for AWS, then you can deploy your job from the
commandline using “make cloud” (see the Makefile we distributed.) No more clicking and remembering
parameter lists for the GUI! However, you need to make sure that the variables in the beginning of the
Makefile are set according to your environment, e.g., bucket name on S3 etc. Note: it may be necessary
to run an EMR job manually one time in order for AWS to create the default security roles and to
determine your region’s subnet-id.
WARNING: Leaving large data on S3 and using EMR will cost money. Read carefully what Amazon
charges and estimate how much you would pay per hour of computation time before running a job. Use
only the small or medium machine instances. And remember to test your code as much as possible on
your development machine. When testing on AWS, first work with small data samples.
Efficient Way of Using AWS
Joe created this useful document to help you get started with an efficient industry-style setup:
https://docs.google.com/document/d/1-UjNVFasTSzhAaqLtKSmeie6JZMinhtVEqCEZwUkxeE/edit?usp=sharing
Report
Write a brief report about your findings, using the following structure.
Header
This should provide information like class number, HW number, and your name.
Weather Data Source Code
Make sure the source code for each of the program versions for the weather data analysis is easy to find
in your submitted source code files. With proper project structure and naming, this should be fairly
straightforward.
IMPORTANT: Please ensure that your code is properly documented. In particular, there should be
comments concisely explaining the role/purpose of a class, and comments that clarify what is locked.
But do not over-comment! For example, a line like “SUM += val” does not need a comment. As a rule of
thumb, you want to add a brief comment for a block of code performing some non-trivial step of the
computation. You also need to add a brief comment about the role of any major data structure you
introduce.
Weather Data Results
For each of the versions of your sequential and multithreaded program detailed in B and C, report the
minimum, average, and maximum running time observed over the 10 runs. (5 points)
Report the number of worker threads used and the speedup of the multithreaded versions based on the
corresponding average running times. (5 points)
Answer the following questions in a brief and concise manner: (4 points each)
1. Which program version (SEQ, NO-LOCK, COARSE-LOCK, FINE-LOCK, NO-SHARING) would you
normally expect to finish fastest and why? Do the experiments confirm your expectation? If not,
try to explain the reasons.
2. Which program version (SEQ, NO-LOCK, COARSE-LOCK, FINE-LOCK, NO-SHARING) would you
normally expect to finish slowest and why? Do the experiments confirm your expectation? If
not, try to explain the reasons.
3. Compare the temperature averages returned by each program version. Report if any of them is
incorrect or if any of the programs crashed because of concurrent accesses.
4. Compare the running times of SEQ and COARSE-LOCK. Try to explain why one is slower than the
other. (Make sure to consider the results of both B and C—this might support or refute a
possible hypothesis.)
5. How does the higher computation cost in part C (additional Fibonacci computation) affect the
difference between COARSE-LOCK and FINE-LOCK? Try to explain the reason.
Word Count Local Execution
Show a screenshot of your IDE window. It should contain the following information:
 Project directory structure, showing that the WordCount.java file is somewhere in the src
directory. (10 points)
 The console output for a successful run of the WordCount program inside the IDE. The console
output refers to the job summary information Hadoop produces, not the output your job emits.
Show at least the last 20 lines of the console output. (10 points)
Word Count AWS Execution
Show a similar screenshot that provides convincing evidence of a successful run of the Word Count
program on AWS. Make sure you run the program using at least three machines, i.e., one master node
and two workers. (10 points)
Once the execution is completed, look for the corresponding log files, in particular controller and syslog,
and save them.
Deliverables
1. Source code for your sequential and parallel versions of the Java programs, including build
scripts and their configuration files (optional) and a single Makefile (optional). These files do not
need to be included if you make them available to TAs and instructor through the CCIS Github.
Make sure to tell us which repository to look for. (20 pts)
2. The report as discussed above. (1 PDF file)
3. The log files (controller and syslog) for a successful run on AWS. (plain text files) (10 points)
4. The final result file(s) produced on AWS. (plain text file) (10 points)

CS 6240: Assignment 1 solution

Download Details:

Description

CS 6240: Assignment 1 solution

Download Details:

Description

Related products

CS 6240: Assignment 5 solution

CS 6240: Assignment 3 solution

CS 6240: Assignment 4 solution