Multiply a scalar with vector code

4/2/2023

As N gets larger there is less chance that the same memory segment has to be loaded into cache simultaneously. For the smaller values of N there is less reuse of the cache, so multiple processors will load the same memory segment (cache line), which yields memory conflicts and therefore a degradation of the speedup. The speedup for MATRIX 2 always increases as N gets larger. In the cases N = 10 4 and N = 10 5 the speedup is good since on multiple processors a greater percentage of requests is satisfied by the two levels of cache than on 1 processor. The speedup for MATRIX 1 increases upto N = 10 5, drops for N = 10 6 and then slightly increases again for N = 10 7. Speedups for CSR matrix-vector multiplication and run times on one processor. For N ≥ 10 6 the data requests are dominated by memory, which causes the decrease in performance for MATRIX 2.įigure 2. The data requests for values N ≤ 10 4 are dominated by L1 cache, while the data requests for values N = 10 5 are dominated by L2 cache. This means that for these values of N most accesses are from cache. In the table we see that for upto N = 10 5 the times for MATRIX 1 and MATRIX 2 are almost the same.

In Figure 2 plots with the speedup for a range of matrices is depicted for both cases and the run times on one processor are given. So the first case is cache friendly whereas the second case is not. To study the cache and memory effects we tested two extrema, both with 10 elements per row: (i) all elements are in the first 10 columns and (ii) the elements are randomly distributed over the row. The indirect addressing in array b is cache unfriendly if the non-zeros in A are positioned irregularly. This is done by using a small number of tasks T in order to reduce distribution time and avoid any false sharing of cache lines. The amount of computation for this algorithm is small, so parallel overhead must be reduced as much as possible. That is, it's a block of code that updates a shared resource that can only be updated by one thread at a time. Therefore the code is a critical section. In our example, in order for our code to produce the correct result, we need to make sure that once one of the threads starts executing the statement, it finishes executing the statement before the other thread starts executing the statement. Recall that more generally, when multiple threads attempt to access a shared resource, such as a shared variable or a shared file, at least one of the accesses is an update, and the accesses can result in an error, we have a race condition. This example illustrates a fundamental problem in shared-memory programming: when multiple threads attempt to update a shared resource-in our case a shared variable-the result may be unpredictable. In fact, unless one of the threads stores its result before the other thread starts reading from memory, the “winner's” result will be overwritten by the “loser.” The problem could be reversed: if thread 1 races ahead of thread 0, then its result may be overwritten by thread 0. So we see that if thread 1 copies from memory to a register before thread 0 stores its result, the computation carried out by thread 0 will be overwritten by thread 1.

0 Comments

Multiply a scalar with vector code

Leave a Reply.

Author

Archives

Categories