## Description

## 1 Measuring Execution Time

In this part we will be writing a simple C program to measure the amount of

time required to compute the sum of the integers from 0 to 10000. The Linux

performance tools library allows us to accurately count the number of cycles it

takes to execute a chunk of code.

Two files are given to you, counters.h and counters.cpp, which take care of

setting up and reading the hardware performance counters.

## 1.1 C Skeleton

The following code snippet shows how to use the provided counter library. This

code shows how to initalize a hardware counter, and also how to retrieve the

current time in cycles.

1

#i n cl u d e

#i n cl u d e

#i n cl u d e

#i n cl u d e

## 1.2 Computing the Sum

Use the following code snippet to compute the sum.

i n t main ( i n t argc , cha r ∗ argv [ ] )

{

//Get the number o f the u s e r

i n t n = a t o i ( argv [ 1 ] ) ;

//Compute sum

lo ng lo ng sum = 0 ;

f o r ( lo ng lo ng i =0; i sum += i ;

}

Save this snippet to a file called sum.cpp, and compile it using:

g++ −msse4 −O2 sum . cpp −o sum

And run it using:

. / sum 10000

## 1.3 Tasks

Use what you learned from section 1.1 and 1.2 to time the execution for computing the sum from 0 to 10000 (exclusive). Remember to print out the result

of the sum after computing the elapsed time. Answer the following questions:

• With -O2 enabled, how many cycles does it take to compute the sum?

• Without -O2 enabled, how many cycles does it take to compute the sum?

Now try repeating the above experiments but without printing out the results

of the sum.

• With -O2 enabled, how many cycles does it take to compute the sum if

you don’t print the result?

• Without -O2 enabled, how many cycles does it take to compute the sum

if you don’t print the result?

In section 1.2 you might notice that we read in the number 10000 as an input

from the user. What happens if we instead hardcode the number directly into

the program like so:

//Compute sum

lo ng lo ng sum = 0 ;

f o r ( lo ng lo ng i =0; i <10000; i++)

sum += i ;

3

Now how long does it take to execute the sum?

To summarize, there are three axis to explore, whether -O2 optimization is

on or off, whether you print out the results of the sum or not, and whether you

accept the number 10000 from the user or whether it is hardcoded.

Time all 8

configurations of the above options (3×8 points) and explain the discrepancy in

the timings (2 points). Which configuration is the most accurate way of timing

the computation for the sum (2 points).

Also how does “atoi(argv[1])” work? What is “argv[1]” (1 point) and what

does “atoi()” (1 point) do?

## 2 Measuring Memory Latency

In this part we will be writing a benchmark for measuring the number of cycles

between requesting and obtaining a value from memory, also known as the

memory latency. We will use a technique called “pointer chasing” to do this.

## 2.1 Random Values

Before we explain the pointer chasing benchmark, we need to be able to create

a shuffled array. In order to do this, we need to be able to generate random

integers first. The following code snippet shows how to initialize the random

number generator and generate a random integer between 0 and 10.

#i n cl u d e

#i n cl u d e

#i n cl u d e

#i n cl u d e

## 2.2 Pointer Chasing

Assume we start with an array, A, of length N, initialized with the integers

{0 . . . N − 1} in shuffled order.

• Initialize the starting value of our index, i, to 0.

4

• In the first step, update i by setting it to the number contained in A[0].

For this example, let us say that i is now 3.

• In the second step, update i by setting it to the number contained in A[3].

Let us say that i is now 7.

• In the third step, update i by setting it to the number contained in A[7].

• Continue in this fashion for the required number of steps.

Program a pointer chasing benchmark that follows the steps outlined above.

Allow the length, N, of the array to be inputted by the user. Time how long

it takes to run the simulation for 2

20 steps, and compute the average number

of cycles to complete a single step. Remember to keep in mind the subtleties

involved in timing a computation you learned from part 1.

## 2.3 Tasks

• Explain how this pointer chasing benchmark measures the time between

requesting and obtaining a value from memory (i.e. the memory latency).

(2 points)

• How many bytes in memory is an int array of length N on the computer

you are running on? (1 point)

• Generate a plot showing the relationship between the average number

of cycles to execute a step (vertical axis) and the size in bytes of the

array (horizontal axis). Vary the size of the array from 32 kilobytes to 8

megabytes. (32 points)

• Discuss the shape of the curve. In particular, explain what the sudden

slope changes in the curve correspond to. (4 points)

• In class we discussed how modern out-of-order processors can execute

many instructions in parallel through the use of pipelining. Thus even if

a single instruction takes t nanoseconds to complete, ten instructions will

likely take much less than 10t nanoseconds to complete. Does this affect

our benchmark? That is, if it takes t cycles to run c steps in our pointer

chasing benchmark, how can we be sure that a single step is actually

completing in t

c

cycles and that the steps aren’t simply being pipelined?

(2 points)

• BONUS: We would ideally like the index, i, to eventually traverse through

the entire array. That is, we would like i to take on all the values between

0 and N eventually.

However not every array A allows this. Write a

function that when given an array A, computes how many unique values

i actually takes on. Explain how the code works. (16 points)

5

## 3 Measuring Memory Bandwidth

In this part we will be writing a benchmark for measuring the average number

of bytes that the memory system can transfer to the CPU per second. This is

known as the memory bandwidth. The two most important parameters that

characterize a memory system is its latency and its bandwidth.

## 3.1 Measuring Time in Seconds

In order to measure time in seconds, instead of in cycles as we did previously,

we need a new timing routine. The following code prints out the current time

measured in seconds.

#i n cl u d e

#i n cl u d e

## 3.2 Array Copying

In order to measure the memory bandwidth of your computer system, we are

going to time how long it takes to copy an array. Write a program that given

an array of length, N, calculates the average number of megabits copied per

second (Mbps). Remember to first “warm up the cache” by copying the array a

few times before starting timing.

## 3.3 Efficient Array Copying

We have provided two different efficient array copying procedures in “simd_copy.cpp”,

simd_memcpy and simd_memcpy_cache. These two routines use SSE intrinsics, to maximize the performance. Alter your program to be able to use any of

the three array copying procedures to time how long it takes to copy an array

of length, N.

## 3.4 Tasks

• Plot the relationship between the average bandwidth (in Mbps) sustained

by the memory system and the total size (in bytes) of the array. Produce a plot using each of the three copy routines, your own routine,

6

simd_memcpy and simd_memcpy_cache. Vary the size of the array from

32 kilobytes to 8 megabytes. (3 × 8 points)

• Discuss the shape of the curve. Explain any sudden increases in bandwidth

as the array size varies. (4 points)

• Explain why it is necessary to first “warm up the cache” by copying the

array a few times before starting timing. (2 points)

• We are trying to measure the maximum bandwidth that the memory system can sustain. Explain why an inefficient array copying procedure would

yield an inaccurate measure of the bandwidth. (2 points)

• The simd copying procedures use SSE intrinsics to improve the performance. Intel and AMD processors come with a special set of instructions,

called SSE instructions, for directly controlling the simd functional units.

SSE intrinsics are a pleasant C wrapper for these instructions, so that the

user does not need to write assembly code to gain access to this functionality. Find all the different SSE intrinsics calls (they start with “_mm_”)

and explain what each of them do. The “Intel Intrinsics Guide” is a good

resource. (2 × 3 points)

### 4 Measuring Flops and IPC

In this part we will be writing a benchmark for measuring the number of floating point operations completed per second (Flops) of matrix multiply and the

number of instructions completed per cycle (IPC). Flops is an important performance measure for numerical code.

### 4.1 Measuring Instructions

The provided counter library can measure the number of instructions executed

in a chunk of code in a similar manner to how we measured the number of cycles

elapsed.

// I n i t i a l i z e i n s t r u c t i o n co u n t e r

hwCounter_t c ;

c . i n i t = f a l s e ;

i n i t I n s n s ( c ) ;

//Get c u r r e n t i n s t r u c t i o n count

uint64_t count = g e t I n s n s ( c ) ;

. . . . do some computation . . . .

//Compute number o f i n s t r u c t i o n s e xe cu te d

uint64_t e x ec u t ed = g e t I n s n s ( c ) − count ;

7

We will be using this feature to measure the IPC of four different implementations of matrix multiply.

### 4.2 Matrix Multiply

We are providing a binary file, square_sgemm.o, that contains four different implementations of square matrix multiply (opt_simd_sgemm, opt_scalar1_sgemm,

opt_scalar0_sgemm, and naive_sgemm), and a skeleton file, mmultiply.cpp,

that shows how to call them.

Here is the prototype for naive_sgemm:

void naive_sgemm ( f l o a t ∗Y, f l o a t ∗A, f l o a t ∗B, i n t n ) ;

Naive_sgemm takes two square matrices, A and B, and their width, n, as input

and stores the result of multiplying A and B in Y. The other implementations

have identical signatures.

Write a program that computes the average number of floating point operations completed per second (Flops) and the number of instructions completed

per cycle (IPC) for multiplying matrices of size 2

10

.

### 4.3 Tasks

• Produce a table recording the Flops and IPC of each matrix multiply

implementation. (6 × 4 points)

• How many floating point operations (in this case, multiply and add) are

performed to multiply two square matrices of size n? (3 points)

• Compute the Flops by dividing the total number of floating point operations performed by the time in seconds required.

• Some of the implementations should have measured IPC exceeding 1, indicating that the processor is completing more than one instruction per

cycle. How can this be? (2 points)

• Comparing opt_scalar1_sgemm with opt_simd_sgemm, opt_simd_sgemm

should have a lower measured IPC than opt_scalar1_sgemm even though

it performs nearly eight times more floating point operations per second.

How can this be? Isn’t a higher IPC always an indicator of an efficient

program? (2 points)

8