Microway Application Note 26

Why Virtual Processors

Virual Processor Description

The BSP Model

The PUB-BSP libraires: Introduction

Installing, Using and Running PUB

The DAPPLE libraries: Introduction

Installing, Using and Running DAPPLE

HPF and Virtual Processors

Why Virtual Processors

   In general parallel programming paradigms like message passing while providing good scalability suffer from the drawback of the communications bottleneck and the inability to exploit the available CPU, cache and memory resources fully. Thus the operating system (OS) is in these cases responsible for scheduling CPU cycles and allocating memory resources. In many cases this leads to wasted CPU and cache resources. This resulted in CPU manufacturers providing features of instruction level parallelism achieved by executing instructions at a very high clock rate to achieve high performance. A good example of this is the Intel netburst architecture. However there still is the large unexploited domain of task level parallelism that can be provided at the CPU stage to further improve performance, and exploit fully the wasted CPU clock cycles and related resources that are not efficiently scheduled by the OS particularly on SMP machines. With the implementation of hyperthreading in Intel Xeon Processors a huge window of oppurtunity has been provided for fully multi-threaded applications. The concept of hyperthreading is based on the implementation of virutal processors (logical processors) as compared to the physical processors (CPU'S). However one is not dependent on a hardware implementation of virtual processors and this can be done at a software level. This application note introduces the concept of virtual processors, and shows how one can implement and exploit it using the Bulk Synchronous Parallel (BSP) Model software libraries PUB-BSP, DAPPLE, and via load balancing HPF programs by migrating virtual processors.

Virtual Processor Description

   In creating an application to run using multiple processors, it is often the case that more processors are needed than those that are available. The virtual processor is a possible solution to this problem. The actual processors (physical processor) are shared by partitioning the local memory of the physical processors. Each virtual processor is mapped to a physical processor.  This processor time sharing of virtual processor actions can be compared to multiple users using one computer.These virtual processors are used as if they are physical processors.  Virtual processor must have their connections, communication links between processors, defined and parallel variables can be assigned to these "processors."  The number of virtual processors(V) divided by the number of physical processors(P) is called the virutal-processor ratio or VP ratio (V/P). Virtual processors can be used on a SIMD machine(connection machine). However, it is not always practical to use virtual processors. If the mapping of these virtual processors to physical processors is not efficient, then the time of communication and time sharing between the  physical processors negates the benefit of having the added "processors"(virtual processors).  A poor mapping of the virtual processors results when at least one physical processors has significantly more virtual processors mapped to itself than other processors. As an example consider the following fragment of code that can be used to demonstrate multi-level parallelism:

DO I = 1, NROWS
        Y(I) = 0
        DO J = 1, NCOLS
          Y(I) = Y(I) + A(I, J)*X(J)
        END DO
END DO

This is a simple example of parallelism being exploited at multiple levels that is given by the computation of the vector Y as the product of a matrix A and a vector X. In this example, parallelism is present at several levels. The iterations of the outer loop may execute independently and concurrently, on a single processor or on multiple processors. On each processor, a number of independent instruction streams, each with its own physical register set and program counter, may work on distinct outer loop iterations. It is the parallelism offered by many independent computations that allows processors to remain utilized without data caching or fast local memories. Within a single stream, several memory references may be simultaneously outstanding while other instructions from the same stream are executed. As on any processor, the time between issuing a load instruction and its completion depends on the latency imposed by the memory, the network, and other hardware. In the example above, during the time required to load A(I, J), processors can issue additional instructions that are independent of the particular pending multiply. For example, the processor may issue an instruction loading A(I, J + 1)or an instruction multiplying some previously loaded element of A. At every clock cycle, each processor chooses and begins execution of a ready instruction, long before the previously issued instruction has completed. At a higher level, other portions of the same program may execute on streams on the same set of processors. Even portions of other users' programs may occupy streams on this set of processors. In summary, there are many potential sources of parallelism, each representing an opportunity to utilize processors more fully and to improve performance. This is essentially the basis for the use, advantages and need of virtual processors. What I wish to emphasize here is another important concept. "Parallelism not placement." Sustaining high performance on any parallel machine requires high processor utilization. In turn, high utilization requires that data is immediately available for each operation and that parallelism is continually present in the computation being performed. When parallelism wanes, processors idle and become unproductive. Most computer architectures provide features to speed access to a portion of the data. For example, many computers copy referenced data to fast cache memory close to the processor. In this case, data locality is exploited to improve performance. Exploitation of fast memory through programming is known as data placement. In many cases, data placement complicates programming and performance tuning. In the example above of computing the vector Y as the product of a matrix A and a vector X, the discussion did not consider data placement, only the available parallelism. Virtual processor performance is unaffected by data placement because memory is uniformly accessible and there are no data caches. Let us reconsider the example, assuming placement is important. 

DO I = 1, NROWS
          Y(I) = 0
        DO J = 1, NCOLS
          Y(I) = Y(I) + A(I, J)*X(J)
        END DO
END DO

Because I is invariant in the inner loop while J advances, Y(I) may be stored in a register, requiring only two loads from memory per multiplication. When considering data placement, however, there may be a problem: the computation accesses A(I, J) values in row-major order (that is, A(I, 1), A(I, 2), and so forth). On machines with multiple words of data per cache line, it is much faster to access data in the order it is physically stored in memory than to access arbitrary locations. If, on such cache-dependent machines, A is stored in row-major order, reordering the loops in the above example makes the computation more efficient.

DO J = 1, NCOLS
        DO I = 1, NROWS
          Y(I) = Y(I) + A(I, J)*X(J)
        END DO
END DO

This example requires three memory operations, instead of two, for each multiplication because Y(I) must be loaded from and stored to memory for each iteration. Even if A were stored in row-major order, we face the same problem when multiplying X by A -transpose unless A -transpose was computed prior to multiplication. Another way to improve performance on a cache-dependent machine is to divide A into blocks such that each block and the associated portions of X and Y fit into fast cache memory. Then, multiply the blocks one at a time to exploit the fast memory. On distributed-memory machines, blocking is required so that each processor can store portions of the matrix in its local memory. Otherwise, the entire matrix must be either copied or communicated during the computation. These types of considerations are directly related to data placement. Machines that require data placement specific to their architecture to achieve high processor utilization force the use of algorithms and implementation styles that exploit those particular features of that architecture. This constrains the design space and complicates the code, especially when porting the code across multiple platforms. On virtual processor machines, the first code example is preferable to the second because it requires fewer memory operations. However, many compilers produce the same machine code independent of which version is used by the programmer, and further rewrite it to achieve even fewer memory operations per multiplication. We can use the same example code to multiply the transpose of A and a vector by replacing A(I, J) by A(J,I). In this case, the same speed is achieved. More generally, the order in which data is accessed is irrelevant. Each word is accessed independently of each other. Blocking is not needed because there are no data caches or local memories. Thus unlike MPI and other message passing paradigms there is not need for the standard send / receive pairs. This send / receive pairwise operations on the host and destination processors is very error prone because deadlocks can be easily introduced if a message is never accepted. Further it is difficult to determine the correctness and complexity of programs implemented using pairwise sends and receives.

   Thus the virtual processor computational model is thread-centric, not processor-centric. A thread is a software entity comprised of a sequence of instructions that are nominally issued in order, respecting jumps and skips. A thread has a finite state, which is defined at any given point by the values in its registers and program counter. Following are the characteristics of the set of threads that execute a single user program.

The implementation of the virtual processor paradigm is done successfully using the BSP model, an extension of the traditional PRAM (Parallel Random Access Machine) Model. PRAM is based on an idealized parallel model of computation with a unit-time communication diameter. The PRAM is a shared memory machine that consists of a set of identical processors, where all processors have unit-time access to any memory location. The appeal of a PRAM is that one can ignore issues of communication when designing algorithms, focusing instead on obtaining the maximum parllelism possible in order to minimize the running time necessary to solve a given problem. The PRAM model typically assumes a SIMD strategy, so that operations are performed synchronously. If multiple processors try to simultaneosly read or write from the same memory location, then a memory conflict occurs. There are several variations of the PRAM model targeted at handling these conflicts, ranging from the Exclusive Read Exclusive Write (EREW) model, which prohibits all such conflicts, to Concurrent Read Concurrent Write (CRCW) models, which have various ways of resolving the effects of simultaneous writes. Another model is an intermediate one called Concurrent Read Exclusive Write (CREW) PRAM. The BSP Model is extension of the PRAM in which communication occurs and a communication cost is also considered between the processors during synchronization and where each processor works using local data. BSP can be implemented on packet switching networks as well as on optical crossbar designs. We shall now consider the BSP model in detail.

The BSP Model

   A BSP computer consists of the following:

There are no specialised combining, replication or broadcasting facilities. Thus a BSP computer consists of a set of processor/memory modules, a routing network, and a mechanism for synchronizing the processors in a barrier style. The performance of a BSP computer depends on three parameters:

Obviously the number of processors and computational speed of each are key parameters. If we define a step to be the basic unit of calculation (often taken to be a single floating point operation), then we can denote the speed as s steps/sec. We can also see that the capacity and speed of the communications network is a vital element. For ease of comparison between systems, we always measure the performance of the communications network in units of the computing speed. The cost of carrying out a barrier synchronisation of the processors, for example, can be measured in terms of the number of steps that could have been performed in the time taken to synchronise. This let's us contrast a system with fast synchronisation, in which relatively few steps could have been executed during the time it takes to synchronise, with one which has much worse performance relative to its computational powers. In general we can expect better overall performance from a system with low values of this parameter. Similarly when we estimate the communications throughput of the network linking the processors, we look at the cost in steps for each word of data transmitted. This gives the ratio of the computing power to the communication power of the system. The lower this figure is, the better the balance between compute power and communications power, the easier it is to get scalable performance. We therefore arrive at the following four parameters:

Because the speed is in effect a normalising factor in these parameters there are only three independent parameters, p, l and g. Note that all are based on the bulk properties of the system. The values are determined by actual measurement using suitable benchmarks that mimic average computation and communication loads. The speed s is measured as the actual rate at which useful calculation is done; it is NOT the peak performance figure quoted in the manufacturer's data sheet. The value of g is calculated from the average cost of transferring each word of messages of all sizes in the presence of other traffic on the network. It is NOT based on the manufacturer's claimed bi-section bandwidth. It is not measured from single point-to-point transfers but measures the sustainable speed that will be experienced by real application code. The g value can be approximated by calculating (total number of local operations by all processors per second)/(number of words delivered by the communications system per second). g enables you to estimate the time taken to exchange data between processors. If the maximum number of words arriving at any one processor during a single such exchange is h, then we estimate that up to gh steps could have been executed during the exchange. For a network of workstations, with say processors capable of 2000 Megaflops (sustained) might have g values in the range of a few thosands FlOps per word transferred, and l values of the order of 100,000 to 1,000,000 FlOps. The closer the g value approaches 1 and the smaller the l value is for a given system, the easier it is to produce scalable parallel performance on that architecture. Increasing the number of processors without increasing the total computational workload will inevitably spread the calculation load more and more thinly on each processor. With fewer and fewer calculations per node before there is a need to communicate, the relative cost of communications, both latency and transfer costs, begin to dominate. High values of g and l mean that this limit sets in early. With g and l equal to 1, the cost of accessing remote data is approximately the same as accessing local data and the calculation scales to the limit. 

   The essence of the BSP approach to parallel programming is the notion of the superstep, in which communication and synchronisation are completely decoupled. A BSP program is simply one which proceeds in phases, with the necessary global communications taking place between the phases. This approach to parallel programming is applicable to all kinds of parallel architecture: distributed memory architectures, shared memory multiprocessors, and networks of workstations. It provides a consistent, and very general, framework within which to develop portable software for scalable computing. Thus a BSP computation consists of a sequence of parallel supersteps, where each superstep is a sequence of steps carried out on local data, followed by a barrier synchronisation at which point any non-local data accesses take effect. Requests for non-local data, or to update non-local data locations, can be made during a superstep but are not guaranteed to have completed until the synchronisation at superstep end. Such requests are non-blocking; they do not hold up computation. This approach is less prone to deadlocks since there are no explicit receives. Instead the barrier synchronization signifies the end of communication operations. Moreover, BSP provides a simple cost model that makes it relatively easy to determine the cost of executing a parallel program. Figure 1 below shows a typical sequence of steps on a BSP type computation. As seen in Figure 1 there is clearly no limit on the number of virtual (logical) processors per given physical processor, as this division (P1 P2 ... Pn) can be implemented at a software level by emulating physical processors and dividing resources using multi-threading, and enforcing cache coherency at a logical processor level. This will be disccused in more detail shortly in three BSP implementations: the PUB-BSP one (using C), the DAPPLE one (using C++), and via load balancing through migration of virtual processors using HPF.

 Figure 1: Typical sequence of steps in a BSP type of computation.

   The programmer's view of a BSP computer is that it has a large, globally accessible, memory. The division between data held locally and data held on remote processors is consistent with modern non-uniform memory architecture (NUMA) systems which often exhibit a two level memory access time characteristic. This includes virtual shared memory machines with all types of coherent or non-coherent cache structures. This type of architecture is expected to dominate because it is perceived to be simpler to programme. To achieve scalability it will be necessary to organise the calculation in such a way as to obviate the bad effects of large values of l and g, i.e.of the large latencies in the communications network. In some cases it will be necessary to select different algorithms for different ranges of l and g values. By separating the computation on local data from the business of transferring shared data, which will be handled by lower level software, we can ensure that the same computational code will be able to run on different hardware architectures from networked workstations to genuinely shared memory systems. The superstep structure of BSP programs lends itself to optimisation of the data transfers. All transfers in a superstep between a given pair of processors can be consolidated to form larger messages that can be sent with lower (latency) overheads and so as to avoid network contention. The lower level communications software should also exploit the most efficient communication mechanisms available on the actual hardware. Since this software is application-independent, the cost of achieving the efficiency can be spread over many applications and is an acceptable one. Another advantage of the simple structure of BSP programs is that the modelling of their performance is much easier than for message passing systems, for example. In place of the random pair-wise synchronisation that characterises message passing, the superstep structure in BSP programs makes it relatively easy to derive cost models (i.e. formulae that give estimates for the total number of steps needed to carry out a parallel calculation, including allowance for the communications involved). The profiling information is easy to read and interpret even from a text file and debugging of BSP programs is relatively straightforward. The other major advantage is that the number of virtual processors available in the entire cluster can be changed dynamically as the program executes at literally every super step. This makes possible the concept of full utilization of resources with a slight overhead of detecting every processors load. The length between super-steps can also be adjusted making it easier to allocate resources in rapid load changing scenarios. Finally checkpointing and roll back is easy in case of failures as backtracking to a previously saved state (super step) is relatively straightforward. In essence the implementation of the BSP model as we shall see offers a thread-safe programming environment, and this has resulted in the hyper threading model implementation in Intel Xeon processors. This is considered in more detail in Microway Application Note 27.

The PUB-BSP libraires: Introduction

   The Paderborn University BSP (PUB) library is based on the BSP model but has a number of additional features that make it useful. In addition to providing basic BSP functionality (point-to-point communication and barrier synchronization), it provides a rich set of collective communication operations like broadcast, reduce and scan. These operations are used very frequently and they often determine the overall efficiency of a parallel application. It is also possible to dynamically partition the processors into independent subsets. After a partition operation, each subsystem acts as an autonomous BSP computer and a barrier synchronization involves only the processors in the subsystem. Hence the PUB library supports nested parallelism and subset synchronization. This feature is especially useful when completely different subalgorithms have to be performed independently in parallel. Without this possibilty algorithms have to be interleaved, and this creates additional complexities as in most cases the separate algorithms need separate synchronizations. Another feature supported are BSP objects. These objects serve three purposes:

Figure 2 gives the various layers of PUB. The machine (platform) dependent part are important and are important points to consider for porting applications. The SYSTEM layer provides a few operating system dependent functions like getting system time or host name. The THREADS layer provides the multi-threading support. All functions of the PUB-library are thread-safe, but one BSP object and all its subgroups must not be used by more than one thread. If you want to use BSP objects in different threads, create them with bsp_threadsave_dup. If the underlying operating system supports threads and message passing simultaneously (all except IBM SP/2, CRAY T3E) then a subset of POSIX threads and counting semaphores including mutexes are provided as functions and these are called by including the threads.h header provided rather than the OS thread libraries. This enables one to create true architecture independent programs. The COMM layer is a small set of efficient communication functions. Besides functions for initialization and termination and some inquiry operations, COMM also implements a set of efficient communication operations as message passing functions for sending, sending asynchronously, receiving and probing. This implementation can be done using MPI, TCP or the SHMEM library. For all examples described here we use the TCP implementation. This is the fastest and most reliable for a network of workstations (NOW's) using the native OS TCP implementation. Besides the three platform dependent layers two other features of PUB are extremely important:

Figure 2: The different layers in PUB.

   PUB supports two types of sychronization operations: standard synchronization via bsp_sync(bsp) and oblivious synchronization via bsp_oblsync(bsp,nmsgs). The latter is used only when the number of messages each processor will receive in a superstep has been computed beforehand. For example if each processor sends a message to its right neighbour then every processor knows that it will receive exactly one message. Thus in the latter case each processor waits unitl nmsgs are received. This is very fast with no additional communication required. Between two supersteps the type of synchronization can alternate, but in each superstep all processors have to use the same type of synchronization. We will now consider the installation and implementation of the PUB library in greater detail.

Installing, Using and Running PUB

   The current version of PUB is 7 and is in a tar.gz file format pub-7.0b.tar.gz. The profiler is also in tar.gz file format   v1.4_bsplib_toolset.tar.gz. Unzip and untar to a directory of your choice. For the profiler you will need QT version 1.44 and KDE version 1.1 libraries. For the installation of PUB PERL and the standard gnu c compiler will suffice. However the cluster must have rsh set up. The installation must be identical in all the computers in the cluster, both in terms of the options passed to the configure script and the directory structure. To start the installation issue ./configure. In this case the TCP library with generation of PROFILE information is used. The other options include MPI and DEBUG modes. Figure 3 gives the configure options used.

Figure 3: Compilation sequence for PUB libraries and executibles using TCP and PROFILE mode.

After successful compilation pubcc and pubCC in the ~/pub/bin directory can be used to compile programs or the makefile in ~/pub/examples can also be used. Typing "make all" in the examples directory will compile all the programs there. Typing make [program name] will compile a single program. This is shown for the testbsp program in Figure 4.

Figure 4: Using the supplied Makefile to compile PUB programs.

Next a machines file of the same name needs to be created in the directory from where the programs will be run. Otherwise the -m option with the machines file has to be used when pubd (the pub daemon) is started on the nodes. The machines file has the node names of all the nodes on which the pub daemon is started. In this case the cluster has four dual processor nodes master, node2, node3, node4. The path for the pub daemon must also be set. A typical start of the pub daemon on all the nodes is shown in Figure 5.

Figure 5: Starting pubd on all the nodes. Use the -m option with the machines file name if a default machines file called machines with node names (one name per line) is not provided in the directory from which pubd is started.

The command run [cmd] at the PUB prompt will run the command on all the nodes as shown in Figure 5. The program hello is not provided in the examples directory and is listed here, in Figure 6. The command kill removes all processes from all the nodes. To quit use the command quit, this will stop all the pubd daemons on all the nodes.

Figure 6: A simple hello program to test the initialization of PUB-BSP.

As shown in Figure 6 bsplib_saveargs sends the program args for starting the processes on all the nodes. bsplib_init initializes PUB-BSP and gets the main BSP object bsp. bsp_nprocs returns the number of processors and bsp_pid returns the own processor id.  The processor id is in the range 0,...,nprocs-1. PUB is stopped by a call to bsplib_done. The profile file for this run is shown in Figure 7.

Figure 7: Profile file showing the supersteps in the hello run. The first column is the time step.

In order to describe the PUB library routines fully consider a simple parallel binary search with partition operations as an example. The class void bin_search(pbsp bsp, int d, int m) shown in Figure 8 implements this. It is assumed here that one is familiar with the standard parallel search algorithm, where one has to locate M = m * p queries in a search structure of N = n * p keys. The search butterfly that we use has the root of the search tree replicated p times and every  other level is replicated half as many times as the level above. The outputs of the butterfly are the roots of balanced binary search trees of size n. To locate the queries, they have first to be routed to the correct subtree. This is done by sending them to the left or right half at every level of the butterfly depending on whether the query is smaller than the current node key or not. A processor P belongs to the left (right) half of the butterfly in dimension d, if the d-th least significant bit in its binary representation is 0 (1). The first part of the code in Figure 8 just routes the query to its correst subtree. Here inLeft(d,me) and inRight(d,me) are macro's that return 1 if processor me belongs to the left or right half of the butterfly in dimension d, respectively (and can be replaced by tests me<nprocs/2 and me>=nprocs/2 with the ID of the processor opposite to processor me in its group given by (me+nprocs/2)%nprocs), and Opposite(d,me) is a macro that returns the ID of the processor opposite to processor me in dimension d. The call bsp_send sends a message of "size" bytes to processor "dest". The first argument bsp is a pointer to an object of type tbsp, which must be passed to most PUB functions. The call bsp_sync performs the barrier synchronization. When it returns all messages are available in a message queue at their destinations. The size of this queue can be determined by calling bsp_nmsgs, and messages can be fetched by calling bsp_getmsg(bsp,i). This function returns a message handle msg to the i-th message in the queue, but does not remove it from the queue (it exists until the next synchronization point). This random access to the message queue allows for greater flexibility and now messages do not have to be explicitly copied. The message handle msg can be used to obtain the data part of the message (by calling bspmsg_data(msg)), as well as to determine its size (bsp_msg(msg)) and the ID of the source (bspmsg_src(msg)). As these functions operate on messages no BSP object has to be supplied. In order to prevent fine grained parallelism bulk transfers are introduced. Each processor has p output buffers. Small packets are not sent immediately, but instead the packet payload and some header information is written into the correct buffer. Messages larger than the buffer are sent instantaneously. If the buffer is full, the data contained in the buffer is sent to its destination. In this way packet combining can be performed, whilst still overlapping communications with local computations. Thus in order to pack data the usual bsp_send is replaced by bsp_hpsend. This function is unbufferd, hence the data should not be changed until the next synchronization. This reduces data copying. However because data still has to be bundled, it improves performance only slightly. The relative benefits are larger when the message payload is already stored contiguously. Since the binary search algorithm employs a divide and conquer strategy paritioning the processors into several independent subsets with subset synchronization offers many advantages. These are:

The call to bsp_partition (Figure 8) partitions the processors into two groups of equal size. In general bsp_partition(bsp,subbsp,n,part) partitions the processors into n groups consisting of the processors [0,part[0]-1],[part[1]-1],etc., and returns a pointer subbsp to a new BSP object of type tbsp. The processors rejoin by calling bsp_done. Because the processors are renumbered after a partition operation (from 0 to the group size minus one), the code routing each query to its subtree is also simplified. The enquiry function bsp_pid(bsp) returns the ID of the calling processor in the group associated with BSP object bsp, and bsp_nprocs(bsp) returns the number of processors in the group.

void bin_search(pbsp bsp, int d, int m){

me=bsp_pid(bsp); nprocs=bsp-nprocs(bsp);

for(i=new_m=other=0; i < m; i++)

   if (query[i]<=gkey(d] && me>=nprocs/2 || query[i]>gkey[d] && me<nprocs/2)

        temp[other++] = query[i];

   else

        query[new_m++] = query[i];

bsp_hpsend(bsp, (Tne+nprocs 2) % nprocs, temp, other*sizeof(int));

bsp_sync(bsp);

  msg = bsp_getmsg(bsp, 0);

memcpy(&query[new_m], bspmsg_data(msg), bspmsg_size(msg));

new_m += bspmsg_size(msg) / sizeof(int);

if (!d) local_search(new_m,query,n,key);

else {

     part[0] = nprocs/2;

     part[l]= nprocs;

     bsp_partition(bsp, subbsp, 2, part);

     bin_search(subbsp, d-1, new_m);

     bsp_done(subbsp);

     }

}

Figure 8: Binary Search class function for implementing a parallel binary search using PUB-BSP.

Note instead of bsp_sync the function bsp_oblsync(bsp,n) can be used, since in this case the number of messages each processor is to receive is known exactly. Another point to note is that in contrast to the way global communication is implemented in most message-passing libraries, in PUB they are non-synchronizing. This means that the results are not available until the next synchronization barrier. It has the advantage however that one can intiate several broadcast or parallel prefix operations in the same superstep. Thus PUB collective communication operations can be performed on any arbitrary subset of processors, so unlike MPI all the processors do not need to participate. For example the function bsp_gsend(bsp,first,last,buf,size) broadcasts the data of length size starting at buf to the processors first,...,last. If the processors are not consecutively numbered then bsp_lsend(bsp,table,n,buf,size) needs to be used. This broadcasts the data pointed to by buf to the processors table[0],...,table[n-1].

Figure 9 gives the source code for creating virtual processors. Besides being used for testing a program with different number of nodes, and debugging a parallel program on a single processor machine, as well as fully using CPU resource while it is waiting say for memory access one can use it for writing efficient external memory programs with virtual memory.

Figure 9: Source code for the program simulator that creates vp number of virtual processors.

Figure 10 gives a sample run for this with vp=2. Hence each processor now has two virtual processors identical to the case of hyperthreading which is discussed in more detail in Microway Application Note 27.

Figure 10: Running simulator with vp=2 creating 2 virtual procesors per CPU in the 8 CPU cluster.

The pubd was started with 8 processors and each processor now has two virtual processors giving a total of 16 processors.

The DAPPLE libraries: Introduction

   DAPPLE or DAta Parallel Programming Language for Education is a set of C++ class library designed to provide the illusion of a data-parallel programming language on conventional hardware and with conventional compilers. DAPPLE defines Vectors and Matrices as basic classes, with all the C operators overloaded to provide for elementwise arithmetic. In addition, DAPPLE provides typical data-parallel operations such as scans, permutations, and reductions. Finally, DAPPLE provides a parallel if-then-else statement to restrict the context of the above operations to subsets of vectors or matrices. Dapple is however restricted unlike PUB to work only on one machine as it does not support comminications but only separate memory allocations for data parallel operations. However by using the concept of virtual processors which DAPPLE supports it is possible to exploit fully the SMP environment of muli-processor CPU computers. DAPPLE provides a classic example (like PUB) of implementing virtual processors using software, unlike the case of hyperthreading in Xeon processors that creates virtual processors using the BIOS and hardware. However the number of virtual processors is created using DAPPLE is large. Thus while DAPPLE collections (vectors and matrices) can be thought of as a set of data elements, it can also be interpreted as a set of data elements each of which has a virtual processor for manipulating that element. The actual memory allocations and the parallel data operations occur transparently, and cannot be set by the user. The size of the problem that a given machine can handle for satisfactory speed-up will eventually depend on the available memory and CPU resources.

Installing, Using and Running DAPPLE

   To install DAPLLE download the file dapple.tar.Z and untar and unzip it. You will need g++ version 2.6.3 or higher to compile the libraries. This is acheived by typing make depend followed by make in the top level dapple directory. The library file (dapple.a) and header (dapple.h) will be in the ~/dapple/lib directory. The examples directory can be used for running the programs. Typing make and the file name in the examples directory can be used to compile *.cc program files that use dapple. Figure 11 gives a typical make dialog for a simple program that computes pi. The program listing is given in Figure 12.

Figure 11: Example of making a program that uses DAPPLE.

Figure 12: Source code for the DAPPLE program that computes pi.

   DAPPLE provides two new classes of data: Vector and Matrix. You can define one- and two-dimensional data sets, and manipulate them with many of the usual C++ operators, in parallel. In addition, you can compute some functions of the entire vector or matrix, such as summing all the elements. Finally, you can move elements around, perhaps to sort them. A vector is simply a one-dimensional collection of elements. Each element is a scalar value, and can be an integer, character, boolean, or single- or double-precision floating-point number. For example, a vector of N integers called "data" can be defined with intVector data(N). The elements in this vector are not initialized. Later, it should be initialized in an assignment statement. To initialize it when it is defined, so that all of the elements have the same value, you can include a second argument:  intVector data(N, 3). Here, all N elements of vector "data" have the value 3. If you have an array that already contains some interesting data, you can initialize the vector from the contents of the array:

       const int N = 4;

       int array[N] = {45, 23, -3, 9};

       intVector data(N, array);

A final possibility is to initialize the elements to some function of their position in the vector. (The first element is at position 0, the next at position 1, and the last at position N-1.)

      int somefunc(const int position);

      intVector data(N, somefunc);

Here, the first element of the new vector will be initialized to somefunc(0), and the last element will be initialized to somefunc(N-1).The predefined function called Identity() is useful for this purpose: Identity(i) returns i, so intVector data(N, Identity) initializes data to the set of values 0, 1, 2, ..., N-1.

There are other vector types, used the same way:

     intVector A(N);

     charVector B(N);

     floatVector C(N);

     doubleVector D(N);

     booleanVector E(N);

Matrices are used much like vectors. For example, the following all define matrices with N rows and M columns:

     intMatrix A1(N,M, IdentityR);

     intMatrix A2(N,M, IdentityC);

     charMatrix B(N,M, 'x');

     floatMatrix C(N,M);

     doubleMatrix D(N,M, data);

     booleanMatrix E(N,M, false);

Integer matrix A1 is initialized so that every element's value is its row number. IdentityR(r,c) is a predefined function thatreturns r. Integer matrix A2 is initialized so that every element's value is its column number. IdentityC(r,c) is a predefined function thatreturns c. Thus, for matrices, the function initializer takes two parameters (the row and column) instead of one. All elements in character matrix B are initialized to the character 'x'. Float matrix C is uninitialized. Double matrix D is initialized from a two-dimensional array called data, which should have the dimensions [N][M]. Finally, all elements in boolean matrix E are initialized to false.

Once you have vectors or matrices defined, you can manipulate them, in parallel, much like their scalar equivalents. For example:

     intVector A(N), B(N), C(N);

     ...

     A = B + C;

What this means is that corresponding elements (which are scalars) of the vectors B and C are added, and then assigned to the corresponding element of vector A.

It is similar to the C code

     int A[N], B[N], C[N];

      ...

     for (i = 0; i < N; i++) A[i] = B[i] + C[i];

There are many elementwise operators: Note not all operators apply to all types of vectors or matrices.

One additional elementwise operation is to apply a function to every element of a vector or matrix. The function, and any functions it calls, must be pure, that is, it must be free of side effects like writing global variables or doing I/O. For example:

     extern double sqrt(const double);

     doubleMatrix A(N,M), B(N,M);

     ...

     A = apply(sqrt, B);

Each element of B is passed to sqrt(), and the result stored in the corresponding element of A.   

Sometimes it is convenient to operate on a single row, column, or element of a matrix, or a single element of a vector. (Of course, in data-parallel programming we wish to avoid this practice as much as possible, and operate on the entire vector or matrix, but sometimes it is necessary). You can refer to single elements ("piece") by subscripting:

     intVector A(N);

     intMatrix B(N,N);

     ...

     A[3] = 5;

    B[1][3] = A[4];

You can also refer to a particular row or column of a matrix (a "slice") with a special subscripting form:

     intVector A(N);

     intMatrix B(N,N);

     ...

     A = B[2][_]; // assign row 2 of matrix B to vector A

     B[_][3] = A; // assign vector A to column 3 of matrix B

Notice the underscore (_) that is used as a placeholder, to indicate "all of the indices in this dimension." Slices can be used anywhere a vector can be used. The first use above is a row slice; the second is a column slice. Row slices need not use a placeholder, that is, B[3] means the same thing as B[3][_] (it is better style to include the [_], however).

The multiplication operator (*), like all the other operators, is an elementwise operator. When manipulating mathematical vectors, however, there are two kinds of vector product: inner product (also called a dot product) and outer product (also called a cross product). The inner product produces a scalar, and the outer product produces a matrix. DAPPLE has methods for both:

     intVector A(N), B(N), C(N), D(M);

     intMatrix E(N,M);

     ...

     C = inner(A, B); // inner product of A and B

     C = sum(A * B); // inner product of A and B

     ...

     E = outer(C, D); // outer product of C and D

It is often convenient to define

const intVector VP(N, Identity); const intMatrix VProw(N, M, IdentityR); const intMatrix VPcol(N, M, IdentityC);

for use in expressions that depend on the position, e.g.,

intVector X(N), Y(N); ... Y = VP * X; // compute i * x[i], for i = 0 to N-1

Sometimes you wish to perform an operation on only a subset of the elements of a vector or matrix. In the following example, we wish to compute the absolute value of a vector A:

ifp (A < 0) B = -A; else B = A;

The ifp() statement works much like an if() statement in C, except that it works in parallel, and the condition is a vector or matrix condition. In the example above, the expression A < 0 is a vector condition. In parallel, every element of A is compared with 0, resulting in a boolean value (true or false). (Thus, the result of such a comparison is a booleanVector). The subset of elements (virtual processors) where the expression was true executes the then clause. The subset of elements (virtual processors) where the expression was false executes the else clause, if any. In this example, every element of B will be assigned some value, either -A or A, as appropriate. Note that both clauses are always executed; it's just that a different set of virtual processors are active in each clause.

Here's another example:

ifp (A != 0) B /= A; else B = 0;

ifp() may be nested. Each ifp reduces the size of the set of virtual processors that are active. Thus, in

ifp (A == B) C = 0; // A == B here else ifp (A < B) // A != B here C = -1; // A < B here else C = 1; // A > B here

The second ifp() is part of the else clause of the first ifp, so it is only executed by those virtual processors who are active there, i.e., those who found their elements A != B.

Within one of the clauses of an ifp() statement, as we said, some subset of the virtual processors are active. Some subset of which set of virtual processors? Of those that correspond to the elements in the vector or matrix condition provided to ifp(). Any other vectors or matrices used within the clause must have the same shape , i.e., be a vector of the same size, or a matrix with the same number of rows and columns, as that in the condition expression. (Otherwise, what subset of virtual processors would be active in that vector or matrix?) Thus, in

intVector A(10), B(10), C(20); ... ifp (A > 0) B = A; // ok else C = A; // error: C has a different shape

It is often convenient to define

const intVector VP(N, Identity); const intMatrix VProw(N, M, IdentityR); const intMatrix VPcol(N, M, IdentityC);

for use in situations that depend on the position, e.g.,

intVector X(N); ... ifp (VP % 2 == 0) // here we can work with X[i] where i is even else // here we can work with X[i] where i is odd

A subtle point about slices and context: most of the time, slices act like a vector, but they can also act like a matrix. For example, in

intVector A(N); intMatrix B(N,N); const intVector VP(N, Identity); const intMatrix VProw(N, N, IdentityR); const intMatrix VPcol(N, N, IdentityC); ... ifp (VP % 2 == 0) B[2][_] = 33; // assign 33 to row 2 of matrix B ifp (VPcol % 2 == 0) B[2][_] = 33; // assign 33 to row 2 of matrix B

the context in the first ifp is that of an N-element vector, and in the second ifp is that of an NxN-element matrix. Both have the same effect, however. In other words, when deciding whether the virtual processors of a slice are active, if the context is a vector, the slice is thought of like a vector, and if the context is a matrix, the slice is thought of like a matrix (using the appropriate row or column of the context to determine the activeness of elements in the slice).

A reduction takes a vector or matrix and produces a scalar value; for example, a sum reduction computes the sum of all values in the vector or matrix.

DAPPLE includes many different kinds of reductions, for both vectors and matrices:

intVector A(N); int x; ... x = sum(A); // x = A[0] + A[1] + ... + A[N-1] x = any(A); // x = true iff ANY of A[i] != 0 for i in 0..N-1 x = all(A); // x = true iff ALL of A[i] != 0 for i in 0..N-1 x = n_nonzeros(A); // x = number of nonzero elements in A x = n_active(A); // x = number of active elements in A x = max_value(A); // x = max of A[i] for i in 0..N-1 x = min_value(A); // x = min of A[i] for i in 0..N-1

Only the active elements contribute to the reduction. sum(), max_value(), and min_value() do not apply to boolean vectors or matrices.

DAPPLE includes some reductions only for vectors:

intVector A(N); int x; ... x = first(A); // x = A[first_index(A)] x = first_index(A); // x = i if i is the smallest-numbered active virtual processor x = max_index(A); // x = i if A[i] is max of A[j] for j in 0..N-1 x = min_index(A); // x = i if A[i] is min of A[j] for j in 0..N-1

Again, only the active elements contribute to the reduction. max_index(), and min_index() do not apply to boolean vectors.

A scan is an operation that takes a vector and produces a vector. Each element in the result represents the combination of all the elements preceding the corresponding element in the original vector. For example, a sum scan (prefix sum) puts the sum of elements 0..i-1 in element i of the result (element 0 of the result gets a 0, since it is the sum of no elements).

DAPPLE includes many different kinds of scans:

intVector A(N), B(N); ... A = plus_scan(B); // A[i] = B[0] + B[1] + ... + B[i-1] A = max_scan(B); // A[i] = max of B[j], for j in 0..i-1 A = min_scan(B); // A[i] = min of B[j], for j in 0..i-1 ... booleanVector C(N), D(N); ... C = or_scan(D); // C[i] = D[0] || D[1] || ... || D[i-1] C = and_scan(D); // C[i] = D[0] && D[1] && ... && D[i-1]

Note that the arithmetic scans can be applied to anything but boolean vectors, while OR and AND scans apply only to boolean vectors.

Note that the value of the first element is independent of the contents of the vector. For plus_scan, the value is 0. For max_scan, the value is the smallest which that type can hold (for integers, -2147483647). For min_scan, the value is the largest which that type can hold (for integers, 2147483647). For or_scan, the value is false , and for and_scan, the value is true.

If some of the virtual processors are inactive, they do not participate in the scan, that is, no value is produced at those elements, and their value is not included in the scan.

Scans can be applied to Matrix objects as well. In this case, each row or column is independently scanned like a vector. The same scans are available, both for rows and for columns.

intMatrix A(N), B(N); ... A = plus_scan_rows(B); // A[i][j] = B[i][0] + B[i][1] + ... + B[i][j-1] A = plus_scan_cols(B); // A[i][j] = B[0][j] + B[1][j] + ... + B[i-1][j] A = max_scan_rows(B); // A[i][j] = max of B[i][k], for k in 0..j-1 A = max_scan_cols(B); // A[i][j] = max of B[k][j], for k in 0..i-1 A = min_scan_rows(B); // A[i][j] = min of B[i][k], for k in 0..j-1 A = min_scan_cols(B); // A[i][j] = min of B[k][j], for k in 0..i-1 ... booleanMatrix C(N), D(N); ... C = or_scan_rows(D); // C[i][j] = D[i][0] || D[i][1] || ... || D[i][j-1] C = or_scan_cols(D); // C[i][j] = D[0][j] || D[1][j] || ... || D[i-1][j] C = and_scan_rows(D); // C[i][j] = D[i][0] && D[i][1] && ... && D[i][j-1] C = and_scan_cols(D); // C[i][j] = D[0][j] && D[1][j] && ... && D[i-1][j]

Again, if some of the virtual processors are inactive, they do not participate in the scan, that is, no value is produced at those elements, and their value is not included in the scan.

DAPPLE has many ways of rearranging the data in a vector or matrix, or more correctly, producing a new vector or matrix with the rearranged data of another matrix.

The elements of a vector may be shifted left (toward smaller-numbered positions) or right (toward higher-numbered positions). All elements participate, regardless of the active set.  As elements are shifted to the right, the left side of the vector fills with zeros. Similarly, as elements are shifted left, the right side fills with zeros. For example,

intVector A(N), B(N); ... A = shift(B, 3); // shift right: A[i+3] = B[i], except A[0..2] = 0 A = shift(B, -2); // shift left: A[i-2] = B[i], except A[N-2..N-1] = 0

Another form of shifting is pack : all active, non-zero elements are pushed to the left, and zero elements pushed right, so that all the active non-zeros are contiguous in the vector, in the same relative order as they began.

intVector A(N), B(N); ... A = pack(B);

A Matrix shift is similar, except that two offsets are specified: one for rows, and one for columns.

floatMatrix A(N), B(N); ... A = shift(B, 3, 4); // shift each element 3 down and 4 right

Sometimes it is convenient to shift each row (or column) of a matrix a different amount.

floatMatrix A(N,N), B(N,N); intVector S(N); // an amount to shift ... A = shift_rows(B, S); // shift each row i by S[i] A = shift_cols(B, S); // shift each column i by S[i]

In the above S could instead be an integer scalar, and all rows (or columns) are shifted the same amount.

There is no Matrix pack() operation.

Rotating a vector is similar to shifting a vector, except that instead of filling one side with zeros, the values wrap around. All elements participate, regardless of the active set.

intVector A(N), B(N); ... A = rotate(B, 3); // rotate right: A[i+3 mod N] = B[i] A = rotate(B, -2); // rotate left: A[i-2+N mod N] = B[i]

Matrix rotations are similar to the vector rotations, in the same way that matrix shifts are similar to the vector shifts. Thus,

floatMatrix A(N), B(N); intVector S(N); // an amount to shift ... A = rotate(B, 3, 4); // rotate each element 3 down and 4 right A = rotate_rows(B, S); // rotate each row i by S[i] A = rotate_cols(B, S); // rotate each column i by S[i] 

Permuting a vector moves each active element of the input vector to a destination in the output vector determined by a second input (permutation) vector, whose elements contain the new position for each active element. A permutation is always an intVector.

floatVector A(N), B(N); intVector P(N); // a permutation, ie, each element is in 0..N-1 ... A = permute(B, P); // permute B by P

The result vector is initialized to be the same as the input vector, and then overwritten by copying the active elements from the input vector to the appropriate destination element in the result vector. Thus, active elements may move to inactive positions.

There is no Matrix permutation operation.

DAPPLE has some primitive I/O capabilities for vectors and matrices. In particular, vectors and matrices can be used with iostreams for input and output:

intVector A(N); intMatrix B(N,M); ... cin >> A; // it expects N whitespace-separated integers cin >> B; // it expects N*M whitespace-separated integers, row-major order ... cout << A; // prints N tab-separated integers (no newlines) cout << B; // prints N lines, each with M tab-separated integers

Only the active elements are written, so it may actually print fewer than N (or N*M) elements. If entire rows of a matrix are inactive, fewer than N lines may be printed. Similarly, for reading, only the active elements are read from input, and thus only the active elements should be provided as input.

Boolean vectors and matrices are printed as 0 and 1. Character vectors and matrices are printed without tab separators, and read without expecting them.

Vectors and matrices can also be loaded from an array, or stored into an array:

intVector A(N); intMatrix B(N,N); int array1[N]; int array2[N][N]; ... A.load(array1); B.load(array2); ... A.store(array1); B.store(array2);

Only the active elements are loaded or stored.

Here are some "gotchas", places where the DAPPLE semantics lead to surprising results:

   As another example consider the code for a parallel qsort (quick sort) algorithm shown in Figure 13. The results of running the qsort program are shown below.

// qsort -- a parallel quicksort program
//   written using the Data-Parallel Programming Library for Education (DAPPLE)
// This parallel quicksort recurses through smaller and smaller subsets
// of the list, as with any quicksort; we track the current subset by
// using DAPPLE's dynamic active-set facility.  That is, only those processors
// in the current subset are active.  The recursive quicksort routine
// can tell the size of the list with an n_active reduction, pick a splitter,
// and then compare all active elements against the splitter in parallel.
//    Each recursive call acts only on the elements that correspond to
// active VPs, and assigns values to the permutation vector only for those
// VP.  On return from quicksortr, the permutation vector holds the
// values of the eventual position in the sorted output list for each
// of the currently active VPs.
#include <stdlib.h>
#include <iostream.h>
#include "dapple.h"
const int N = 11;
// which VP am I? useful for VP-specific math
const intVector VP(N, Identity);
const int data[N] = {16, 15, 19, 12, 14, 13, 15, 11, 17, 15, 18};
void quicksort(intVector& X);
intVector quicksortr(const intVector& X, const int left);
int
main()
{
    intVector X(N, data);             // X the main data array, initialized here
    cout << "Initial:     " << X << endl;
    quicksort(X);
    cout << "Final:       " << X << endl;
    return(0);
}
// @SUBTITLE "quicksort - sort a vector"
// We call an auxiliary function quicksortr to determine the permutation
// needed to sort X, and then we actually sort X by permuting it.
// The permutation vector is simply a list of N values, one for each
// data value in X, that are the *position* in the sorted list where that
// data element should go.
void
quicksort(intVector& X)            // X is input and output
{
    intVector P(N);                      // a permutation vector
    P = quicksortr(X, 0);              // find the permutation needed to sort X
    cout << "Permutation: " << P << endl;
    X = permute(X, P);                // permute X according to P
}
// @SUBTITLE "quicksortr - the recursive part"
// X is not changed here.  We are responsible for filling in P;
// well, that part of P corresponding to the current active set.
//   X is the vector to sort.
//   "left" is the destination VP for the leftmost element of the sorted
// set of integers represented by the current active set of VPs.
//   returns a permutation vector giving the final destination of
// every element in the active set.
intVector                                    // a permutation vector
quicksortr(const intVector& X, const int left)
{
    int n = n_active(X);                // how big is this sublist?
    intVector P(N);                       // the result
    // check the number of active processors (ie, size of our sublist)
    if (n <= 0)
      ;                                             // do nothing
    else if (n == 1)
      P = left;                                 // the one element goes right THERE
    else if (n == 2) {
       // too small to recurse, so just check to see if we need to swap
       ifp (VP == min_index(X))
         P = left;                              // left one get smallest
       else
         P = left+1;                          // next one gets largest
    } else {                                     // n >= 3
       int splitter;                            // splitter value
       int splitVP;                            // VP# that holds the splitter value
       int middle, right;                    // starting position of subsets
       // pick a splitter; for simplicity I'll just use the first value
       splitter = first(X);
       splitVP = first_index(X);        // which VP holds the splitter?
       // find the left half, those less than or equal to splitter
       // except for that first one...
       ifp (X <= splitter && VP != splitVP) {
       // compute our destination in the result vector
        P = quicksortr(X, left);
        middle = left + n_active(X);  // rest will begin here
       }
       // move the splitter into the middle
       ifp (VP == splitVP) {
           P = middle;                        // route it there
           right = middle + 1;             // the rest will begin here
       }
       // do the right half, those greater than the splitter
       ifp (X > splitter) {
           // compute our destination in the result vector
           P = quicksortr(X, right);
       }
    }
    return P;
}

Figure 13: The qsort program (above) and results of running it as an example (below).

This program qsort can be used to demonstrate several features of DAPPLE as well as its weaknesses. qsort demonstrates the ability of DAPPLE to manipulate data within a vector and in particular its ability to dynamically narrow context to a subset of the virtual processors. The quicksort procedure recursively sorts the active portion of its vector argument. Initially quicksort is called with all processors active. It begins by using the reduction n_active() to find the size of the subvector it is to sort. Then, it dispenses with two special cases: subvectors of size 0 or 1 are trivially sorted, and a subvector of size 2 may require a swap. The reductions min_value(), max_value(), and first(), are used to compute the minimum and maximum values and assign them to the appropriate element. Otherwise partition and recurse. To partition, it chooses a splitter value (here, the value at the first active processor), builds a permutation subvector that specifies the destination of every element in the repartitioned subvector, and then permutes. It restricts the context of the left partition and recurses, and then restricts the context to the right partition and recurses. This example demonstrates one weakness of DAPPLE, its inability to support nested data parallelism. The two recursive calls to quicksort() must be done sequentially, each with only a small subset of the virtual processors active. Thus other sorting algorithms could also be more appropriate.

HPF and Virtual Processors

   HPF (high performance fortran) is a data parallel language based on Fortran 90 (HPF 1.0) and Fortran 95 (HPF 1.1). The most important feature of this language is that it defines a three level mapping of data, which represents in fact arrays. The first level is an alignment level. Arrays are aligned relative to one another or to a template, which is an abstract reference array. The language provides the ALIGN directive for this purpose. Then, arrays are distributed onto abstract processors using the DISTRIBUTE directive of HPF. This distribution is defined onto a rectilinear arrangement of virtual processors. The last level is declared to be an optional implementation-dependent directive. The PROCESSOR directive consists in mapping virtual processors onto physical processors. As opposed to this most other compilers map exactly one virtual processor onto one process. Processes are assumed to be one per physical processor, so they use the virtual onto physical processor mapping very poorly. Figure 14 gives the schema of the 3 levels of mapping and distribution of the HPF model.

Figure 14: The HPF Model in which arrays are aligned with a template which is distributed onto abstract or virtual processors, that are then mapped onto physical processors.

The ALIGN directive: This phase aims at minimizing communications by forcing elements of different arrays to be aligned at the same location as some elements of the template. Thus for any distribution all the elements of arrays alligned with an element of the template are guaranteed to be mapped onto the same processor.

The DISTRIBUTE directive: This phase deals with the spreading of data onto virtual processors. The basic distributions are the block (block) and cyclic (cyclic(i)) distributions which can be seen like special cases of the block-cyclic distribution (cyclic(n)). The distribution of a k-dimensional array representing a template onto a l-dimensional array of virtual processors, with l smaller than k, is specified dimension by dimension. If l is strictly smaller than k, the remaining dimensions are replicated onto all the virtual processors.

The PROCESSOR directive: This directive uses a user-declared Cartesian mesh of abstract (or virtual) processors. Abstract processors represent memory regions. The implementor can use the same number or a smaller number of physical processors to implement it. The language also provides an intrinsic function that returns the number of physical processors. It is useful to compute the number of virtual processors according to the number of physical processors. A typical example distribution is shown in Figure 15.

Figure 15: The array A(4,6) is mapped on the array of vritual processor P(2,3) by the directive of distribution DISTRIBUTE A(BLOCK,CYCLIC) ONTO P. The first dimension of A is distributed block-wise whereas the second is cyclicly distributed. The different shades of the array elements represent teh abstract processors on which they are mapped. The alignment phase to a template is often by-passed when there is only one array to distribute.

With regular computations and data distributions HPF is very efficient. However the problem arises from irregular distributions where load balancing in an issue that burdens the programmer. In order to achieve good load balancing the compile-time knowledge about the program with dynamic load balancing depending on run time considerations must both be considered. In any event the resultant outcome in case of irregular programs must be such that maximum performance is acheived by getting as close to a regular computation as is possible. Regularity allows efficient access to the memory which is now linearly described. As a consequence the cache hit and thus the bandwidth of the memory is very high because the usual data cache management policies are optimized for linear memory runs. The data distributions also have simple representations. Functions that map data onto virtual processors and their inverses are simple functions. The computation of communication schedules in regular codes is also easy. Irregularity can creep in from an unbalanced distribution of computations or from a repeated irregular communication pattern. Unless an heterogenous computing environment is being used, irregularity generally leads to slower execution and improper utilization of the resources. On heterogenous computers creating an irregular distribution can balance the load and/or reduce the overall communication cost in terms of number of messages and/or in terms of communication volume according to the architecture of the underlying network of communications.

In terms of irregular distributions the second level function cyclic(n) is not well suited. Assume that these is a directive that specifies for each element of the array the virtual processor to which it must be assigned and that this assignment can change. Thus, we have the most general definition for a distribution directive that can support all distributions. This results in a complete loss of regularity. Further for codes that have a part of regularity it leads to high overheads. Thus more regular - but still irregular - distributions have been proposed.  Real codes like unstructured mesh codes or molecular dynamics codes need irregular distributions to do load balancing. The High Performance Fortran Forum includes general block distributions and irregular distributions as approved extensions in the definition of HPF 2.0. The general block distribution and irregular distributions are restricted to arrays of one dimension. They can use a variable to define the mapping when applied with the REDISTRIBUTE directive.

The third level controls the mapping of abstract processors onto physical processors. Irregularity can be obtained by using an irregular distribution of virtual processors. If there are more virtual processors than physical processors the load can be balanced by distributing virtual processors in such a way that the load of each physical processor is as close as possible to the average load. Figure 16 shows an irregular distribution of virtual processors onto 3 physical processors.

Figure 16: Example of irregular virtual processor distribution. Let processor A be 3 times as fast as processor C and processor B 2 times as fast as processor C. The array A(4,6) is distributed onto the virtual processor mesh P(2,3) using the directive DISTRIBUTE A(BLOC,CYCLIC) onto P. Then, the load balancing system can map these 6 virtual processors onto the 3 physical processors in a balanced way because it has some knowledge of the execution environment.

Finally comes the problem of implementing the virtual processors. There are three approaches to take all as migrating threads. The usual simple way is to allocate each virtual processor to an independent process, which is what HPF compilers assume. This is shown in Figure 17.

Figure 17: One vritual processor per process. Array A is distributed onto a 2 x 3 grid of 6 virtual processors. Each virtual processor is embedded into a process. The physical machine in this case has only 3 processors, so each physical processor receives 2 processes.

This solution is easy to implement. Current HPF compilers can generate more processes than physical processors, so it is possible to have several processes per physical processors, leading to the allocation of several virtual processors per physical processor. The migration can be achieved by numerous systems that support process migration like Condor, Cocheck, MPVM, PBS etc. The drawback is that the context switch of processes is expensive. The communication between processes on the same processor are very slow as compared to memory accesses. The replication of code also causes a waste of memory on each physical processor. Finally there are the migration overheads even with NFS.

The other approach is to map several virtual processors into a process (and then onto a processor). The process code essentially emulates the virtual processors by virtual loops. For a sequential machine it consists of embedding each data parallel instruction in a loop that enumerate all virtual processors, a technique often used in compiling explicit data parallel languages like C*. Figure 18 shows an example of this.

Figure 18: Virtual processor emulated by virtualization loops. Array A is distributed onto a 2 x 3 grid of 6 virtual processors. Virtual processors are emulated by virtualization loops in which the physical machine has only 3 processors, so that each physical processor supports 2 virtual processors emulated by an unique process.

The advantages of this method include the fact that direct memory communications between the virtual processors are availble. With a small memory overhead a process can manage many virtual processors. Migration is also minimal because only date and some management variables need to be sent. However due to dependencies in HPF the need for scheduling has to be built in, with the scheduling satisfying the dependencies in the virtualization loops. This is not trivial for a programmer. Futher although the overheads associated with the context switching are very low there are just too many of it.

The final method is to transform the parallelism of processes to a parallelism of threads. However the external interference of the OS has to be neglected, or assumed to be small. The solution consists in embedding a virtual processor into a thread. This is shown in Figure 19.

Figure 19: Virtual processor embedded as a thread. Array A is distributed onto a 2 x 3 grid of 6 virtual processors. Each virtual processor is embedded into a thread. The physical machine has only 3 processors, so each physical processor receives 2 threads that are embedded in a process.

The advantages are the fact that this is very simple to implement. The thread only needs a private stack, and communication is via direct memory transfers between threads in the same process. Thus efficient virtual to virtual processor communication is possible. Thread migration cost is also very low since the migration message contains the data, the used stack and thread management variables, with the data the main part of the message. However the big drawback in a multi-threaded environment is thread safety and thread synchronization, and thread context switch. This problem with integration of communication, and efficiency in thread creation is a major concern as the choice between thread creation and thread synchronization is not obvious.

Thus we see that on modern day computers (as is also seen in the case of the Intel Xeon which is considered in detail in the next application note) where on chip cache is large and speed and bandwidth are high, full processor usage can be obtained using virtual processors. The eventual speed up however will taper off and after a certain stage performance degradation is bound to happen. However as all the programs that will be written use PUB, DAPPLE, or HPF (these are easily scalable) the solution to increase performance is simply to increase (add) the number of processors available for the particular task. Thus resource usage is no longer a domain controlled by a OS and one can almost eliminate idle CPU time in running programs using virtual processors as opposed to a serial one where the major CPU idle time occurs duing memory accesses.

 

BACK TO TOP

BACK TO MICROWAY APPLICATION NOTE INDEX

(Author: Nilay K. Roy.)