Sale!

CMSE/CSE 822 Homework 7 Solved

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (6 votes)

Parallel Computing
GPU-based Cardiac Electrophysiology Simulation1 [100 points]
Simulations play an important role in science, medicine, and engineering. For example, a cardiac
electrophysiology simulator can be used for clinical diagnostic and therapeutic purposes. Cell simulators
entail solving a coupled set of equations: A system of Ordinary Differential Equations (ODEs) together
with Partial Differential Equations (PDEs). In this part, you’ll implement the Aliev-Panfilov heart
electrophysiology simulator, which is a 5-point stencil method and requires the solution of a PDE and two
related ODEs. We will be focusing not on the underlying numerics, but on the efficiency aspects of the
simulation.
The simulator models the propagation of electrical signals in the heart, and it incorporates a cell model
describing the kinetics of the membrane of a single cell. The PDE couples multiple cells into a system.
There can be different cell models, with varying degrees of complexity. We will use a model known as the
Aliev-Panfilov model, which maintains 2 state variables and solves a single PDE. This simple model can
account for complex behavior such as how spiral waves break up and form elaborate patterns. Spiral waves
can lead to life threatening situations such as ventricular brillation (a medical condition when the heart
muscle twitches randomly rather than contracting in a coordinated fashion).
The simulator models electrical signals in an idealized system in which voltages vary at discrete points in
time, called timesteps, on discrete positions of a mesh of points. For simplicity, we’ll use a uniformly spaced
mesh (irregular meshes are also possible, but are more difficult to parallelize). At each time step, the
simulator updates the voltage according to nearest neighboring positions in space and time. This is done
first in space, and next in time. Nearest neighbors in space are defined on the north, south, east and west
(i.e., a 5-point stencil).
In general, the finer the mesh (and hence the more numerous the points), the more accurate is the solution,
but at an increased computational cost. Likewise, the smaller the timestep, the more accurate is the solution,
but at a higher computational cost. To simplify your performance analysis, you can run your simulations
for a limited number of iterations on relatively small grids – actual simulations would require very large
number of timesteps on large grids, and hence take several days.
Simulator
The simulator keeps track of two state variables that characterize the electrophysiology that is being
simulated. Both state variables are represented as 2D arrays. The first variable, called the excitation, is
stored in the E[ ][ ] matrix. The second variable, called the recovery variable, is stored in the R[ ][ ] matrix.
Lastly, we store Eprev, the voltage at the previous timestep, which is needed to advance the voltage over
time.
1 Credit: Simula Research Lab (Norway), Scott Baden (LBNL), Didem Unat (Koc Univ.)
Since we are using the method of finite differences to solve the problem, the variables E and R are
discretized by considering the values only at a regularly spaced set of discrete points. The formula for
solving the PDE, where E and Eprev refer to the voltage at current and previous timestep, respectively, and
α is a constant defined in the code, is as follows:
The formula for solving the ODE is shown below. Here kk, a, b, epsilon, M1 and M2 are again constants
defined in the simulator and dt is the time step size.
Serial Code
You are given a working serial simulator that uses the Aliev-Panfilov cell model described above.
Command line options for the cardiac electrophysiology simulator are as follows:
-t Duration of simulation
-n Number of mesh points in the x and y dimensions
-p Plot the solution as the simulator runs, at regular intervals
-x x-axis of the processor geometry (valid only for MPI implementation)
-y y-axis of the processor geometry (valid only for MPI implementation)
-k Disable MPI communication
-o Number of OpenMP threads per process
For convenience, the simulator includes a plotting capability (requires using “-X” argument while accessing
HPCC through ssh to forward the application display to your local machine and loading gnuplot afterwards)
which you can use to debug your code, and also to observe the simulation dynamics. The plot frequency
can be adjusted from the command line. However, your timings results should be measured with the
plotting option disabled.
Example command line

module load CUDA/10.0.130 gnuplot
make
./cardiacsim -n 400 -t 1000 -p 100

The example will execute on a 400 x 400 box, and run to 1000 units of simulated time, plotting the evolving
solution every 100 units of simulated time. For this assignment, you do not have to worry about options -x,
-y, -k, or -o.
E[i,j] = Eprev[i,j] + alpha*(Eprev[i+1,j] + Eprev[i-1,j] +
Eprev[i,j+1] + Eprev[i,j-1] – 4*Eprev[i,j])
E[i,j] = E[i,j] – dt*( kk * E[i,j] * (E[i,j] – a) * (E[i,j] – 1) +
E[i,j] * R[i,j]);
R[i,j] = R[i,j] + dt*(epsilon + M1*R[i,j] / (E[i,j] + M2)) *
(-R[i,j] – kk * E[i,j] * (E[i,j]-b-1));
You’ll notice that the solution arrays are allocated to be 2 larger than the domain size (n+2). The boundaries
are padded with a cell on each side in order to properly handle ghost cell updates using mirroring technique.
You can see Lecture 16 for details about handling ghost cells.
Assignment
You will parallelize the cardiac electrophysiology simulator using a single GPU. Starting with the serial
implementation provided, we ask you to implement 4 different versions using CUDA:
Version 1: Implement a naive GPU parallel simulator that creates a separate kernel for each for-loop in the
simulate function. These kernels will all make references to global memory. Make sure that your naive
version works correctly before you implement the other three options. In particular, check if all data
allocations on the GPU, data transfers and synchronizations are implemented correctly.
Version 2: Fuse all kernels into a single kernel. Note that you can fuse the ODE and PDE loops into a
single loop, thus into a single kernel.
Version 3: Use temporary variables to eliminate global memory references for R and E arrays in the ODEs.
Version 4: Optimize your CUDA implementation by using shared memory (on-chip memory) on the GPU
by bringing a 2D block into shared memory and sharing it with multiple threads in the same thread block.
You should implement these optimizations on top of each other (e.g. Version 3 should be implemented on
top of Version 2). Details of how to implement these optimizations will be discussed in class on Tuesday,
December 5th. Note that these optimizations do not guarantee performance improvement. Implement
them and observe how they affect the performance.
Ghost Cells
Since all CUDA threads share global memory, there is no need to exchange ghost cells between thread
blocks. However, the physical boundaries of the domain need to be updated every iteration using mirror
boundary updates. You can do this in a series of separate kernel launches on the GPU, which will be a bit
costly, or embed mirror boundary updates into a single kernel launch.
Data transfers from CPU to GPU
Note that it is enough to transfer the simulation data from CPU to GPU only once before the simulation
iterations start (before the while(t<T) loop in the main()). During the course of the simulation, E, R, and
E_prev will be updated and used on the GPU only. Optionally, when you would like to generate a plot, you
will need to copy the E matrix from GPU to CPU.
Reporting Performance
• Use t=100 for your performance studies, and experiment with different grid sizes such as n=200, 1000
and 5000. Measure performance without having the plotter on.
• Recall that data transfer time affects the performance. When you measure the execution time and Gflop/s
rates, do not include the data transfer time.
• Tune the block size for your implementation and draw a performance chart for various block sizes.
• Compare the performance of your best implementation with the CPU version (serial).
• Document your findings in your write-up.
Programming Environment
You will use the Nvidia K80 GPUs on HPCC’s Intel16 cluster.
https://wiki.hpcc.msu.edu/display/hpccdocs/GPU+Computing
Since you will run your simulations on small problems for small number of iterations, your execution times
will be short. Therefore, you can use the dev-intel16-k80 nodes for both development and performance
testing purposes. But note that everyone in the class will be using the two GPUs located on the same node.
So, make sure to give yourself enough time to develop and test your program before the project deadline.
Grading
Your grade will depend on 2 factors: correctness of your implementations and the depth and clarity of your
report. Document your work well, discuss your findings, and offer insights into your results and plots.
Implementation (75 points): Version 1, 2 and 4: 20 pts each, Version 3 is 15 pts.
Report (25 points): Implementation description and performance analysis.
Instructions
• Cloning git repo. You can clone the skeleton codes through the git repo. Please refer to “Homework
Instructions” under the “Reference Material” section on D2L.
• Discussions. For your questions about the homework, please use the Slack group for the class so that
answers by the TA, myself (or one of your classmates) can be seen by others.
• Submission. Your submission will include:
o A pdf file named “HW7_yourMSUNetID.pdf”, which contains your answers to the nonimplementation questions, and report and interpretation of performance results.
o Source file used to generate the reported performance results. Make sure to use the exact files
name listed below:
§ cardiacsim.cu
It is essential that you do not change the directory structure or file names.
To submit your work, please follow the directions given in the “Homework Instructions” under the
“Reference Material” section on D2L. Make sure to strictly follow these instructions; otherwise you
may not receive proper credit.
• Compilation and execution. You are provided a Makefile for compiling your code. Simply type
“make” in the directory where the Makefile is located at.