Real Time Resource Management  and Adaptive Parallel Programming for a Cluster of Computers: A Comparison of Different Approaches in a Computationally Intensive Environment.

N. K. Roy. (Ph.D)

Senior Scientist, Microway, Inc., 12 Richards Road, Plymouth, MA – 02360.

 Published in: Proceedings of IEEE Computer Society: IEEE International Symposium on Network Computing and Applications, October 2001, p216-226.

ABSTRACT

   In this paper two classic problems that are computationally intensive and show good speed-up and scalability when solved in a parallel programming environment are used to test the different resource allocation and management algorithms used with the node intrusion and failure experiment. We divide the adaptive resource allocation experiments into two groups: (i) Automatic Survivability and Scalability (ii) Assessment of Real-Time Quality of Service (QoS). In the former we use different algorithms to detect failed programs, host and network resources and idle times, computing an allocation, enactment of an allocation, and restart notification. Here we also use different techniques to detect dynamic paths that are receiving poor QoS possibly due to overload and to “scale up” such paths via reallocation. In the latter case we use different fitness functions to classify the connections and the resources available on the nodes and study the effects of these on the overall resource allocation and the eventual speedup.  

1: INTRODUCTION

   Parallel programming environments such as PVM, MPI, and Linda provide different programming paradigms (message passing, virtual shared memory), different programming styles (master/slave, peer-to-peer, bag-of-tasks, pipelining), and allow different parallel programming approaches to be used (functional parallelism, task parallelism, domain decomposition) in order to solve a computationally intensive problem faster than normally possible. However they do not provide advanced programming tools such as load balancing, application and system reconfiguration, and fault tolerance and recovery mechanisms. While several batch schedulers (in particular PBS) have some of these features like time-sharing of hosts, load balancing and rescheduling of jobs when nodes are down or up, they lack much flexibility in terms handling problems in real time (on the fly). The node intrusion and failure experiments from an inherent part of our adaptive resource allocation experiments. This intrusion is varied so as to increase latency times to becoming an obstruction to the main task. From this we are able to compare the effects of the different resource allocation algorithms we use to solving these two classic problems. Real time resource management forms a very important part in cluster computing. As the wide acceptance of cluster computing on a network of computers is gaining momentum, we see larger and larger clusters, with the probability of failures increasing with the addition of every single node. Further as these computers are networked using some type of LAN the occurrences of communication bottlenecks is certain due to the limited bandwidth available. In many cases the cluster may be heterogeneous, or there may be many users competing for limited resources. While batch schedulers like PBS [1] or Condor [2] can handle the job scheduling and resource allocation based on user policies, the problem arises when there are node failures, or unexpected bottlenecks or latencies. Since the message passing in most clusters is handled by some message-passing paradigm like PVM [3] or the now more widely used MPI [4] it is imperative from a practical point of view that these have some built in mechanism for enforcing data coherency, check pointing, and failure and node load detection. However this is not the case and these programs are largely intolerant to failures. Thus the user has to build in their own mechanism typical of the application and the cluster configuration to evaluate the status of the nodes, enforce data coherency and check pointing, and to kill and reroute jobs or scale up the jobs as resources become available. As we will show even a simple implementation can greatly result in the efficient use of resources, reducing down time of the job, and overcoming the necessity of resubmitting the job for execution. Also in most cases like in the MPI paradigm, the classic “mpirun” used to start a job looks at some form of a machines list and schedules the jobs automatically based on a number of parameters passed to it including the number of nodes or number of instances of the program to run. Accordingly depending on the node(s) specified or the number of instances (in which case the machines list is parsed in some order) the jobs are started. No check is made of the status of the node(s), their load structure and the status of the communication interface. It is assumed that the nodes are equivalent.

   The effects of selecting the nodes on which to run the jobs using a fitness function to determine the status of the quality of service (QoS) of each node before a job is submitted is studied. This forms a static test and as will be shown results in a vast improvement of the resource usage.  Next while the program is running on each node within the message-passing paradigm, checkpoints are introduced where the current state (data and point of execution) is saved for each node and the nodes and the network are evaluated again. If a predefined set of criteria is not met by the node(s) then the program(s) and/or data are migrated to appropriate node(s) and execution is started at the correct point. We study the effects of different algorithms that do this on two computationally intensive problems: LU decomposition [5] and 3D multi-grid potential field solver [6].

2: METHODS AND COMPUTING ENVIRONMENT

   The cluster on which the experiments are conducted is a four-node dual processor i686 SMP cluster in master-slave style. The operating system is Red Hat Linux 7.0 running Linux 2.2.16-22smp kernel. The back plane is the standard 100Mbps fast Ethernet. The MPI paradigm used is LAM-MPI 6.3.2. The programs are invoked from the master using XMPI, so that detailed tracing is possible. The –C syntax supported by LAM-MPI enables the user to specify the processor on each node on which to run the program. Further under XMPI using the “Browse&Run” item in the “Application Menu” a simple file browser pops up enabling the selection of a pre-written application schema that can be run. This is discussed in more detail below. The trace files from a given run can be saved and then be analyzed in detail. A typical trace file is shown below (for the LU problem) in Figure 1. Each node in the cluster runs a script that cats “uptime” and “free” and runs netperf [7] in a 10s TCP stream test to all the nodes into a node status file. The main MPI program then periodically parses this file and the QoS can now be determined from the information in it. Figure 2 below gives this program on the master. The corresponding programs running on the slave

           

Figure 1: A typical XMPI trace file.

nodes are similar. The status file is updated every 100s. For these experiments we assume that disk space is sufficient on every computer in the cluster and there are no disk I/O bottlenecks or conflicts. If disk usage is critical the linux du command can be used to report the disk usage in the above program and an additional term and weight factor can be incorporated into the QoS equations to correctly account for this. However with the implementation of software RAID in linux and the availability of extremely large disks the main issues in cluster computer computing still remain the communication bottlenecks and associated latencies.

Figure 2: Status file script run on every computer in the cluster.

2.1: Assessment of Real-Time QoS

  Obtaining the real-time QoS for the nodes that correctly characterizes the state of the node is the most important factor in maximizing the effects of the dynamic resource allocation and management scheme. To determine the QoS for each computer in the cluster the status file running on the respective computer is used. The main MPI program parses it. The information collected consists of the free available memory, the load average, the free swap space, and the network speed to each computer in the network from the computer in question. Since the status file is updated every 100s. If the main program (always run on the master or node0) cannot access the status file on any node the QoS is set to zero for that node. The case may also arise when the status file is not written fully, as the netperf tests to all the nodes are not completed. In that case there is a wait time with a time out of 10s before the status file is read again. Note that it is rewritten after a time out of 20s. The QoS for each node i ( 0  i  (N – 1) ), QoSi is defined as indicated below in equations 1-4. Note for the master i=0 (node0 or master), and the nodes are numbered from node1, node2, …, node(N-1) although for convenience we number the nodes on the test cluster master, node2, node3, and, node4. w is a weight factor and varies from 0 to 1. Clearly the best cluster will have QoS = 1, and the best node QoSi = 1. The weight factor can be set depending on the type of

________________________________________________________________________

QoSi =                                                                                        …(1)

                          …(2)

NETFITNESSi =                                                                               …(3)

The overall QoS for the cluster is defined as:

  QoS = , where N is the total number of computers in the cluster.                                          …(4)

________________________________________________________________________

problem at hand. For example if there is equal time spent in communication as in computation then it is a good idea to set w=0.5. However if the job is embarrassingly parallel then w can be set to 0.95 or even higher. The QoS forms the basis of the three resource allocation algorithms we use. These are (i) Best Host Algorithm (ii) Round Robin Algorithm and (iii) Random Threshold Algorithm and they are discussed in more detail below. These algorithms form part of the directed acylic graph algorithms (DAG algorithms). The DAG algorithm is known to be NP complete except for three simple cases and these are the ones the above three algorithms we use exploit.

2.2: 3 D Multi-Drid Potential Field Solver (mg)

  This code requires a power-of-two number of processors. It has four critical subroutines [8]. These are the smoother, psinv, the residual calculation, resid, the residual projection, rprj3, and the trilinear interpolation of the correction, interp. The partitioning of the grid into processors occurs such that the grid is successively halved, starting with the z dimension, then the y dimension, and then the x dimension, and repeating until all power-of-two processors are assigned. This problem for running on four processors starts with a grid of size 256 x 256 x 256. The partitioning algorithm will create four partitions each of size 256 x 128 x 128. Each partition will contain 2 coarsest multi-grid level points each surrounded by a cube of grid points of size 128 x 128 x 128 indicated by a top level of 8. There are twenty iterations set for this problem. In order to enforce checkpoints and data coherency in case there is a necessity for a rollback or a restart of execution of the program on another node(s), at the end of every iteration, the state and date of each grid on the node where the job is being run is stored on the master. This however does create an overhead in terms of sending the data back to the master for storage. In this case the job on the master is started only after no other nodes are found suitable for running the application. Thus in the case of our cluster at any time during the execution four out of the eight processors are idle. Hence our node intrusion experiments are concentrated only on those nodes that run the job. We choose to select node2 and node3 on our cluster for running this problem. This is a computationally far more intensive than the communication it required for a given problem size. Hence we try w = 0.5, 0.7, 0.9 for this case.

2.3: LU Decomposition (lu)

   This code also needs a power-of-two number of processors. A 2-D partitioning of the grid onto processors occurs by halving the grid repeatedly in the first two dimensions, alternatively x and then y, until all power-or-two processors are assigned, resulting in vertical pencil-like grid partitions on the individual processors [9]. The ordering of point based operations constituting SSQR procedure proceeds on diagonals which progressively sweep from one corner on a given z plane to the opposite corner of the same z plane, thereupon proceeding to the next z plane. Communication of partition boundary data occurs after completion of computation on all diagonals that contact an adjacent partition. This constitutes a diagonal pipelining method and is called a “wavefront” method. It results in a relatively large number of small communications of 5 words each. This is as communication intensive as it is computationally intensive. Hence we try w = 0.2, 0.4, 0.6 for this case. It is very sensitive to small-message communication performance of an MPI implementation and sends large numbers of very small (40 byte) messages. The matrix size is set at 102 x 102 x 102 and a checkpoint is inserted on the completion of every z plane. This results in a frequent checkpoint as compared to the earlier problem. This problem is also run on node2 and node3 of the cluster, which is where all intrusion experiments are performed. The total number of iterations (time steps) is set at 250. As compared to mg the size of the data that is stored for every node for data coherency in case of a rollback and/or rescheduling/restart on the master is much smaller.

2.4:Algorithm for the Application Schema

   The application schema for the runs is invoked on the master using XMPI. The application consists of the following stages and a block diagram is shown in Figure 3 while the detailed algorithm is listed in Figure 4.

a)Initialization phase: In this part the relevant input files and the executables are staged to the nodes where the programs may be run.  The QoS for all the nodes are evaluated and are stored on the master. Based on the resource allocation algorithm to be used the nodes are selected on which the execution is started.

b)Dynamic Host and Network Resource Discovery Phase: This phase starts just after a checkpoint where the QoS for the nodes are evaluated and the resource allocation algorithm is invoked to check if the jobs may proceed as is or scaling up is needed.

c)Detection of a Failed Program Phase: In many cases the QoS may be above average but for some other reason due to a programming bug or temporary unavailability of a dynamic library a program may fail to continue execution. In such a case it need to be killed, and if it cannot be restarted it needs to be migrated to another node. Thus the tests in these phases form a very important part. The strategy we use here is to run a set of LAM-MPI tests from the LAM_MPI test suite for 10s to check the integrity of the system.

d)Enactment of Allocation Phase: This phase occurs immediately after the Resource Discovery and Failed Program Detection Phase where the jobs continue as is or an interrupt is signaled and the restart/rescheduling and/or scaling up phase is invoked.


                       Figure 3: Flowchart showing the different phases in the application schema.

e)Restart/Rescheduling and/or Scaling Up Phase: In many cases it can be detected that one or more nodes did not execute the program correctly in the Failed Program Phase. In this case there needs to be a rollback so that all the nodes are synchronized and the program execution on all the nodes can be started correctly. In such an event it may be necessary to restart the application on the given node(s) and if this is not possible it may need to be rescheduled and then started on some other node. While in this case the problem has been restricted to a power-of-two number of processors it may be the case than new nodes become available. In such cases there may be a need to reschedule and restart the jobs on more nodes, to ensure better resource allocation. This is the last phase before the next checkpoint is reached.

2.5:Resource Allocation Algorithms

   In order to determine the nodes resources for enactment of allocation three different types of algorithms are used.

a)Best Host Algorithm (BHA): This is a simple algorithm where the QoS list for each node on the master is parsed and then sorted according to the descending order of the QoS number for each node. The jobs are then scheduled one by one starting from the top. Thus for a given number of jobs the best resources are used. The draw back in this method is the fact that it relies heavily on the way the fitness function is defined and on how correctly the fitness function can represent the real situation. Further in a high user environment where there are large fluctuations in the QoS values a good QoS value may not necessarily mean a good node, that may most of the time have a poor QoS value.

b)Round Robin Algorithm (RRA): In this case a threshold value is set for the QoS value for each node before it is included in the round robin



                                                 Figure 4: Algorithm for the application schema.

pool. Thereafter jobs are allocated in a sequential manner with the starting point constantly being shifted in a round robin fashion. While this method may in theory allocate the best resources in the cluster for the program, it does not rely entirely on a fitness function but uses it only as a guide. This algorithm is useful in large user environments and heterogeneous clusters that have different processor speeds and memory. The threshold value is set at 0.7 for this case.

c)Random Threshold Algorithm (RTA): In this algorithm while the threshold sets the nodes that enter into the selection list the final selection is totally random. A threshold of 0.5 is used.

2.6: Node Intrusion

   The node intrusion is done while the program is running only on those nodes on which the job is scheduled so that the QoS on open nodes (that are not running the program) remain high. These intrusions include running netperf in the background between nodes that are running the program and during initialization. The size and length of the packets that are sent and received can be adjusted so that the effect may be either a small communication bottleneck to almost no bandwidth availability for the programs.

  To load any node processor and memory, disk copy of large files and remaking of the kernel is started on the nodes. The number of such processes eventually determines the value of QoS. Figure 5 gives a plot of QoS v/s number of netperf, disk copy and kernel compilations running simultaneously for the case of the master. Note all other nodes are running normally. This plot now serves as a gauge to quantify the effects of the node intrusion being undertaken on any node. For this case w = 0.5.

3: RESULTS AND DISCUSSIONS

   For the two problems and the three algorithms considered we concentrate primarily on the execution time. While XMPI is a powerful tracing environment that offers exact comparison of the skew and patterns in the messages produced and the lags and latencies due to node intrusions, the primary concern in this case is focused more on execution rather than identifying problems due to increased latencies, race conditions and debugging. This is considered in detail in a later paper. The execution times on each processor are easily obtained from the XMPI trace window by moving the slider around and reading the time in seconds (Figure 1). For more details on XMPI

                     

Figure 5: Plot of QoS v/s node intrusion on master with w = 0.5. The number in parenthesis [i,j,k] indicate that i instances of netperf, j instances of disk copy, and k instances of kernel recompilations are running on the given node simultaneously when the QoS is computed.

        

Figure 6: Plot of executions times v/s node intrusion for the mg (left) and lu (right) problem. The number in parenthesis [i,j]  indicate that node intrusion of code i is started on node2, and of code j on node3. The type of node intrusion for the given node intrusion code is given in Figure 5. Refer to the text for an example for this interpretation.

                      

             Figure 7: Plot of execution time v/s node intrusion for different w values for the mg problem.

            

           Figure 8: Plot of execution time v/s node intrusion for different w values for the lu problem.

refer to [10]. Figure 6 gives the plot of the execution time v/s the node intrusion for the two problems (mg and lu) without using any algorithm (static runs). Figure 7 gives the plot of the execution time for the mg problem for the three algorithms for the three different values of w. Figure 8 gives the plot of the execution time for the lu problem for the three different values of w. In all our experiments we start the jobs on node2 and node3. After the fifth iteration in the case of the mg problem and the sixty-second one in case of the lu problem (25% of the run completing in both cases) the node intrusions on node2 and node3 are started and remain active until completion. The x – axis notation for Figures 6 – 8 are interpreted as follows: [i,j] implies that node intrusion with code i was started on node2 and node intrusion with code j was started on node 3. The node intusion codes are given in Figure 5. For example if we have [3,6] this means that after 25% of the run the node intrusion on node2 was [3,3,3] which is 3 instances of netperf, 3 of disk copy, and 3 of kernel recompilations while the node intrusion on node3 was [10,10,10] which is 10 instances of netperf, 10 of disk copy and 10 of kernel recompilations, all running in the background. From Figure 6 we see that there are clearly two regions one in which the execution changes only slightly and the other where there is an almost exponential increase in the execution time. This is the main region of interest. Thus for the dynamic runs we select node intrusions 5, 7, 9, and 11 to concentrate on, and study the effects of the three algorithms with varying w values on the execution times. Figure 7 shows that smallest execution times for the mg problem is obtained only for the BHA and RRA algorithms for the case w = 0.9, and from Figure 8 this is true in the lu problem only for BHA and w = 0.4. However there was considerable decrease in execution time for all cases as compared to the static case especially for node intrusion case 9, and 11.. From this while the resource allocation algorithms did in fact improve the execution times for the mg problem it is clear that the method is sensitive to QoS and the type of algorithm used. Further on SMP systems as in this case as jobs are restarted/rescheduled on the QoS values on a node basis rather than a processor basis, there may be an improvement for the case of the other algorithms that do not detect and restart/reschedule a job earlier. The weight factor in the mg case that worked best was 0.9. As the mg problem was as computation intensive as communication intensive and the fact that there were fewer checkpoints, the need for more emphasis on the state of the local memory and kernel usage is evident from the value of w = 0.9 that showed best results. In the lu problem that was communication intensive and had far more checkpoints that the mg one, greater emphasis on the fitness function should be on network bottlenecks. This is evident from the value of w that worked best (0.4). In the mg problem the RRA algorithm was found to be more suiltable, unlike in the lu case where the BHA algorithm worked better (Figure 8).  For the lu case that was communication intensive the RTA appeared to be unreliable showing a large fluctuation for the node intrusion case 11 and w = 0.2 (Figure 8). Having a higher threshold than the current one that was set to 0.5 for this algorithm may make the algorithm more reliable. Another technique may be to simultaneously use the BHA and RTA algorithm, where several hosts based on the BHA algorithm are selected and then the RTA algorithm is used.

CONCLUSIONS

   Two computationally intensive problems are taken and three different algorithms are used to detect dynamic paths for optimal resource usage. The node intrusion is widely varied and the effects are then studied. From these studies we see that the algorithms themselves need to be selected carefully depending on the type of problem. The parameters like the weights that are set in the fitness function also need to be selected. This necessitates having a good knowledge of the type of program. The program should typically be analyzed to determine the amount of computation and communication. Then the number of checkpoints needed in order to ensure sufficient security without adding to much overheads must also be determined. Finally prior knowledge of the number of users, heterogeneity in the processor speed and the memory, and other programs that will be run will help to classify the degree of intrusion that can be expected on a given node or in the cluster as a whole.  From this a fairly accurate and optimal choice of the algorithm to be used and the parameters to set within the algorithm can be determined. In any case as is shown in this work there is bound to be an improvement of at least an order of magnitude in the execution times if such measures are taken when running computationally intensive tasks in a distributed memory cluster computing environment. Even if an optimal solution is not obtained the speed up obtained even in the worst case justifies the need for standardizing and incorporating such techniques in any large-scale parallel computation task undertaken on a cluster of computers. 

REFERENCES

1http://www.OpenPBS.org

2Jim Basney, Miron Livny, and Todd Tannenbaum, "High Throughput Computing with Condor", HPCU news, Volume 1(2), June 1997.

3Beguelin, Dongarra, Geist, Manchek, and Sunderam, “A Users Guide to PVM (Parallel Virtual Machine), ORNL / TM – 11826, July 1991.

4Greg Burns, Raja Daoud, and James Vaigl, “LAM: An open cluster environment for MPI”, Proceedings of Supercomputing Symposium ‘94, p 379, University of Toronto, 1994.

5Kincaid, and Oppe, “A parallel algorithm for the general LU factorization”, Communications Applied Numerical Methods, 4, p 349, 1988.

6Bailey et al, “The NAS Parallel Benchmarks”, International Journal of Supercomputer Applications, Vol 5, No. 3, p 63, 1991.

7Hewlett-Packard Information Networks Division, “Netperf: A Network Performance Benchmark, Version 2.0, February 15, 1995.

8Bailey and Saini, “The NAS Parallel Benchmarks Results 12-95”, NASA Technical Report NAS-95-021, NASA Ames Research Center, Moffett Field, CA, 94035-1000, December 1995.

9Dongarra, and Walker, “Software libraries for linear algebra computations on high performance computers”, SIAMREV, 37, p 151, 1995.

10Microway_Working_Note_1.html