Sale!

CMSE/CSE 822 Homework 5 Solved

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (4 votes)

Parallel Computing
1) [40 pts] MPI point-to-point vs collectives
Consider the trapezoidal rule for numerical integration (see relevant course slides and/or the textbook).
a) Implement two distributed memory parallel versions of the trapezoidal rule: i) using point-to-point
communications, and ii) using collective communications. See the skeleton code and comments
provided in there for guidance.
b) Time your implementations of the trapezoidal rule using various numbers of sampling points (i.e.,
n = 1000, 100000 or 10M) and MPI processes (i.e., p = 1, 2, 5, 10, 20, 100, 250). How does the
performance of your point-to-point and collective communication versions compare? What are the
speedups and efficiencies? Based on the data you collected, what can you say about the scalability
of your implementation? Note: Maximum number of cores that one is allowed to allocate at once
is capped at 512 per HPCC policies.
c) Although we don’t know the internals of the implementation of MPI collectives, we might guess
that it uses a structure similar to the binary tree we discussed. If this is the case, we would expect
that its run-time would grow roughly at the rate of log2(p), since there are roughly log2(p) levels in
the tree. (Here, p = comm_sz.) Since the run-time of the serial trapezoidal rule is roughly
proportional to n, the number of trapezoids, and the parallel trapezoidal rule simply applies the
serial rule to n/p trapezoids on each process, we get a formula for the overall run-time of the parallel
trapezoidal rule that looks like
for some constants a and b. Use the formula, the timing measurements you’ve taken above, and
your favorite program for doing mathematical calculations (e.g., MATLAB) to get a least-squares
estimate of the values of a and b. Comment on the quality of the predicted run-times using the
formula and the values of a and b you calculated.
2) [60 pts] Code Breaking
Encryption (and how to crack it) is one of the most fundamental concepts in cybersecurity. Data Encryption
Standard (DES) is a widely used encryption method that entails the use of a 56-bit secret key for encrypting
and decrypting data. Better security standards such as the Advanced Encryption Standard (AES) that offer
much wider key lengths have been gradually replacing DES since 2001.
In this problem, you will work on a brute-force cracking of encrypted messages using MPI. While DESencryption can be broken in a matter of days using resources at HPCC, this is certainly not practical for our
purposes. For this reason, we will look at a simple 32 bit encryption algorithm that relies on XOR operations
along with bit rotations. The file encrypter.c implements this 32-bit encryption algorithm sequentially.
Similarly, in file codebreaker.c, you are given a sequential code breaking algorithm which simply tries all
232 keys in a brute force manner to decrypt a file encrypted by the code in encrypter.c. While searching over
the keys, the validation for each candidate is performed by comparing the words in the decrypted message
against those in our small dictionary.
Your task is to parallelize the brute force algorithm given in codebreaker.c file in a few different ways, and
analyze the performance of the resulting implementation. There are two general possibilities for partitioning
this problem:
• Static partitioning where the division of the workload is done a priori based on a predetermined
mechanism,
• Dynamic partitioning where the division and assignment of the workload is done at run-time, permitting
load balancing to be performed, albeit at the cost of a more complicated solution.
We will only be looking at static partitioning techniques.
a) Parallelize codebreaker.c using block partitioning of the set of possible keys. Your code should
follow the following outline:
i) Root process should read the encrypted message and broadcast it to all processes,
ii) Based on the total #processes, each process should determine its own set of trial keys,
iii) When a process discovers the right key, it should print the success message and write the
decrypted message into a file. In addition, it should notify all other processes so that the
MPI program can be terminated immediately and correctly. Hint: This can be achieved by
each process posting an immediate receive operation before trying their own keys. Then
the process that discovers the correct key sends notifications to all others about program
completion.
iv) Do not change the dictionary, the program output message, and/or the output file name.
Your program will be tested by providing a set of random (but properly encrypted)
messages (of length < 1000). Hint: You may find the commented out printf statements useful in understanding or debugging your parallel codes. Make sure to comment out any printf statements that you uncomment before submission. For initial testing, run your parallel codes with a single process only, and try to decrypt messages encrypted with small key values (i.e. key = 5, 10, 100). b) Analyze the scalability of your implementation in part a on a single node of the intel16 cluster. Note that as opposed to regular problems, search algorithms can achieve super-linear, linear or no speedup compared to a sequential program (but the expected speed up will still be linear). Demonstrate each of these possible scenarios by carefully choosing the encryption key and the number of processes (make sure to include examples using up to all 28 processes on a single intel16 cluster node). c) Find and implement a partitioning strategy that overcomes the zero (or sublinear) speedup pitfall of the implementation in part a. Test your new code (version 2) on your examples from part b to show that your new partitioning strategy does indeed overcome the no (or sublinear) speedup pitfall. BONUS: [20 pts] MPI/OpenMP parallelization On a multi-core architecture, good use of system resources can be achieved by first starting a single process for each physical socket in the system (on intel16, there are two sockets (or chips) per node), and then spawning as many threads as the number of cores per socket to fully exploit the multiple cores in that system. This is called MPI/OpenMP hybrid parallelization (obviously, when the threads created are OpenMP threads). • Implement an MPI/OpenMP parallel version of the numerical integration code that you developed in problem 1. • Analyze the performance of the MPI/OpenMP parallel code with respect to the MPI only parallel version. Do you observe better, worse or similar performance? Why do think this is the case? Instructions: • Cloning git repo. You can clone the skeleton codes through the git repo. Please refer to “Homework Instructions” under the “Reference Material” section on D2L. • Submission. Your submission will include: o A pdf file named “HW5_yourMSUNetID.pdf”, which contains your answers to the nonimplementation questions, and report and interpretation of performance results. o All source files used to generate the reported performance results. Make sure to use the exact files names listed below: § trapezoid.c (for problem 1) § codebreaker.c (for problem 2) § codebreaker_v2.c (for problem 2) § trapezoid_hybrid.c (for the bonus problem) These are indeed the default names in the git repository, so do not change the directory structure or file names. To submit your work, please follow the directions given in the “Homework Instructions” under the “Reference Material” section on D2L. Make sure to strictly follow these instructions; otherwise you may not receive proper credit. • Discussions. For your questions about the homework, please use the Slack discussion forum for the class so that answers by the TA, myself or one of your classmates can be seen by others. • Compilation and execution. You can use any compiler with MPI (and OpenMP) support. The default compiler environment at HPCC is GNU. You can compile an MPI parallel code as follows: mpicc codebreaker.c -o decrypter The resulting MPI parallel binary must be executed using mpirun (for example using 4 processes): srun –n 4 ./decrypter encrypted_filename • For hybrid MPI/OpenMP compilation, you need to provide the –fopenmp flag to the mpicc compiler. Remember to set the OMP_NUM_THREADS environment variable, when necessary. • Measuring your execution time properly. The MPI_Wtime() or omp_get_wtime() commands will allow you to measure the timing for a particular part of your program (see the skeleton codes). Make sure to collect 3-5 measurements and take their averages while reporting a performance data point. • Executing your jobs. You should develop, compile and test your programs on the dev nodes at HPCC. However, on the dev-nodes there will be several other programs running simultaneously, and your measurements will not be accurate. After you make sure that your program is bug-free and executes correctly on the dev-nodes, the way to get good performance data for different programs and various input sizes is to use the interactive or batch execution modes. Please consult HPCC’s wiki pages for details (https://wiki.hpcc.msu.edu/display/ITH/Job+Scheduling+by+SLURM ). Note that jobs may wait in the queue to be executed for a few hours on a busy day, thus plan accordingly and do not wait until the last day. • Batch job script. Batch jobs are convenient, especially if you would like to collect data for several runs (this may still take a few hours to complete, but at least you do not have to sit in front of the computer). Note that you can execute several runs for your programs with different input values in the same job script – this way you can avoid submitting and tracking several jobs. An example job script: #!/bin/bash -login #SLURM resource allocations # change to the working directory where your code is located cd ~/cse822-{$USER}/homework/5 # call your executable with different no. of processes/threads srun –n 1 ./decrypter inp1.txt srun –n 14 ./decrypter inp1.txt srun –n 28 ./decrypter inp1.txt # if necessary, list more jobs here… # all output will be written to the default job stdout stream. # if you are trying to debug your code, make sure to print into the # stderr stream, using fprintf(stderr, “…”, …); for instance. # If desired, you can redirect individual results to # a file, i.e., # srun –n 1 ./decrypter inp1.txt >& decrypt_p1.out