Parallel Random Access Machine (PRAM)

image-20210624194946878

To resolve access conflicts

  • Exclusive-Read Exclusive-Write (EREW)
  • Concurrent-Read Exclusive-Write (CREW)
  • Concurrent-Read Concurrent-Write (CRCW)
    • Arbitrary rule
    • Priority rule (P with the smallest number)
    • Common rule (if all the processors are trying to write the same value)

The summation problem

  • Input: A(1),A(2),,A(n)A(1), A(2), \cdots, A(n)
  • Output: $ A(1) + A(2) + \cdots +A(n)$

image-20210624195749289

B(h,i)=B(h1,2i1)+B(h1,2i)B(h, i) = B(h-1, 2i-1) + B(h-1, 2i)

1
2
3
4
5
6
7
for Pi, 1 <= i <= n pardo
B(0, i) := A( i )
for h = 1 to log n do
if i <= n/2^h
B(h, i) := B(h-1, 2i-1) + B(h-1, 2i)
else stay idle
for i = 1: output B(log n, 1); for i > 1: stay idle

T(n)=logn+2T(n)=\log n+2

Disadvantages

  • Does not reveal how the algorithm will run on PRAMs with different number of processors
  • Fully specifying the allocation of instructions to processors requires a level of detail which might be unnecessary

Work-Depth (WD) Presentation

1
2
3
4
5
6
7
for Pi, 1 <= i <= n pardo
B(0, i) := A( i )
for h = 1 to log n
for Pi, 1 <= i <= n/2^h pardo
B(h, i) := B(h-1, 2i-1) + B(h-1, 2i)
for i = 1 pardo
output B(log n, 1)
image-20210624202446180

Measuring the performance

  • Work load – total number of operations: W(n)W(n)
  • Worst-case running time: T(n)T(n)
  • All asymptotically equivalent
    • W(n)W(n) operations and T(n)T(n) time
    • P(n)=W(n)/T(n)P(n) = W(n)/T(n) processors and T(n)T(n) time (on a PRAM)
    • W(n)/pW(n)/p time using any number of pW(n)/T(n)p \leq W(n)/T(n) processors (on a PRAM)
    • W(n)/p+T(n)W(n)/p + T(n) time using any number of p processors (on a PRAM)
1
2
3
4
5
6
7
for Pi, 1 <= i <= n pardo
B(0, i) := A( i )
for h = 1 to log n
for Pi, 1 <= i <= n/2^h pardo
B(h, i) := B(h-1, 2i-1) + B(h-1, 2i)
for i = 1 pardo
output B(log n, 1)

T(n)=logn+2W(n)=n+n/2+n/22++n/2kwhere 2k=n=2nT(n)=\log n+2\\ W(n)=n+n/2+n/2^2+\cdots+n/2^k\quad\text{where $2^k=n$}\\ =2n

WD-presentation Sufficiency Theorem

An algorithm in the WD mode can be implemented by any P(n)P(n) processors within O(W(n)/P(n)+T(n))O(W(n)/P(n) + T(n)) time, using the same concurrent-write convention as in the WD presentation.

Prefix-Sums

  • Input: A(1),A(2),,A(n)A(1),A(2),\cdots,A(n)
  • Output: i=11A(i),i=12A(i),,i=1nA(i)\sum^1_{i=1}A(i),\sum^2_{i=1}A(i),\cdots,\sum^n_{i=1}A(i)
  • Technique: Balanced Binary Trees

image-20210624205107333

image-20210624205154156

  • C(h,i)=k=1αA(k)C(h,i)=\sum^\alpha_{k=1}A(k) where (0,α)(0,\alpha) is the rightmost descendant leaf of node (h,i)(h, i)
  • if ( i==1 ) C(h, i) := B(h, i)
  • if ( i%2 == 0 ) C(h, i) := C(h+1, i/2)
  • if (i%2 == 1 && i != 1) C(h, i) := C(h+1, (i-1)/2) + B(h, i)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
for Pi, 1 <= i <= n pardo
B(0, i) := A(i)
for h = 1 to log n
for i, 1 <= i <= n/2^h pardo
B(h, i) := B(h-1, 2i-1)+B(h-1, 2i)
for h = log n to 0
for i even, 1 <= i <= n/2^h pardo
C(h, i) := C(h+1, i/2)
for i = 1 pardo
C(h, 1) := B(h, 1)
for i odd, 3 <= i <= n/2^h pardo
C(h, i) := C(h+1, (i-1)/2)+B(h, i)
for Pi, 1 <= i <= n pardo
Output C(0, i)

T(n)=O(logn)W(n)=O(n)T(n)=O(\log n)\\ W(n)=O(n)

Merging

  • Merge two non-decreasing arrays A(1),A(2),,A(n)A(1), A(2), \cdots, A(n) and B(1),B(2),,B(m)B(1), B(2), \cdots, B(m) into another non-decreasing array $C(1), C(2), \cdots, C(n+m) $
  • Technique: Partitioning

Partitioning Paradigm

  • partitioning: partition the input into a large number, say p, of independent small jobs, so that the size of the largest small job is roughly n/p
  • actual work: do the small jobs concurrently, using a separate (possibly serial) algorithm for each
  • Merging \rarr Ranking
    • RANK(j,A)=iRANK( j, A) = i, if A(i)<B(j)<A(i+1)A(i) < B(j) < A(i + 1), for 1i<n1 \leq i < n
    • RANK(j,A)=0RANK( j, A) = 0, if B(j)<A(1)B(j) < A(1)
    • RANK(j,A)=nRANK( j, A) = n, if B(j)>A(n)B(j) > A(n)
  • The ranking problem, denoted RANK(A,B)RANK(A,B) is to compute:
    • RANK(i,B)RANK( i, B) for every 1in1 \leq i \leq n
    • RANK(i,A)RANK( i, A) for every 1in1 \leq i \leq n
  • Claim: Given a solution to the ranking problem, the merging problem can be solved in O(1)O(1) time and O(n+m)O(n+m) work
1
2
3
4
for Pi, 1 <= i <= n pardo
C(i + RANK(i, B)) := A(i)
for Pi, 1 <= i <= n pardo
C(i + RANK(i, A)) := B(i)
1
2
3
for Pi , 1 <= i <= n pardo
RANK(i, B) := BS(A(i), B)
RANK(i, A) := BS(B(i), A)

T(n)=O(logn)W(n)=O(nlogn)T(n)=O(\log n)\\ W(n)=O(n\log n)

Serial Ranking

1
2
3
4
5
6
i = j = 0; 
while (i <= n || j <= m)
{
if (A(i+1) < B(j+1)) RANK(++i, B) = j;
else RANK(++j, A) = i;
}

T(n)=W(n)=O(n+m)T(n)=W(n)=O(n+m)

Parallel Ranking

Stage 1: Partitioning

  • 每段划分为 logn\log n 大小,p=n/lognp=n/\log n
  • A_Select(i)=A(1+(i1)logn)A\_Select(i)=A(1+(i-1)\log n) for 1ip1\leq i\leq p
  • B_Select(i)=B(1+(i1)logn)B\_Select(i)=B(1+(i-1)\log n) for 1ip1\leq i\leq p
  • Compute RANK for each selected element

image-20210624221427219

Stage 2: Actual Ranking

  • At most 2p2p smaller sized O(logn)O(\log n) problems

T=O(logn)W=O(plogn)=O(n)T=O(\log n)\\ W=O(p\log n)=O(n)

Maximum Finding

Replace “+” by “max” in the summation algorithm

T(n)=O(logn)W(n)=O(n)T(n)=O(\log n)\\ W(n)=O(n)

Compare all pairs

1
2
3
4
5
6
7
for Pi, 1 <= i <= n pardo
B(i) := 0
for i and j, 1 <= i, j <= n pardo
if ((A(i) < A(j)) || ((A(i) = A(j)) && (i < j))) B(i) = 1
else B(j) = 1
for Pi, 1 <= i <= n pardo
if B(i) == 0 A(i) is a maximum in A
  • To resolve access conflicts: Concurrent-Read Concurrent-Write (CRCW)

T(n)=O(1)W(n)=O(n2)T(n)=O(1)\\ W(n)=O(n^2)

A Doubly-logarithmic Paradigm

  • Assume that h=loglognh = \log \log n is an integer

Partition by n\sqrt n

image-20210624223739351 $$ T(n)\leq T(\sqrt n)+c_1,W(n)\leq\sqrt nW(\sqrt n)+c_2n\\ \rarr T(n)=O(\log\log n),W(n)=O(n\log\log n) $$

Partition by h=loglognh=\log\log n

image-20210624223926467 $$ T(n)=O(h+\log\log(n/h))=O(\log\log n)\\ W(n)=O(h*(n/h)+(n/h)\log\log(n/h))=O(n) $$

Random Sampling

  • Parallel random sampling algorithm can run in O(1)O(1) time and O(N)O(N) work with very high probability, on an arbitrary CRCW PRAM
image-20210624224729520 image-20210624224748446
1
2
3
4
5
while (there is an element larger than M) {
for (each element larger than M)
Throw it into a random place in a new B(n^(7/8));
Compute a new M;
}

[Theorem] The algorithm finds the maximum among n elements. With very high probability it runs in O(1)O(1) time and O(n)O(n) work. The probability of not finishing within this time and work complexity is O(1/nc)O(1/n^c) for some positive constant cc