![]()
A Tutorial on NDP Alpha Compiler Driver and OptimizationsStephen S. Fried, Microway, Inc.Please Note: This was written in 1997.
1.0 Introduction: Alpha family history The first microprocessor capable of executing a billion or more floating point instructions per second became available in late 1996. The DEC Alpha 21164 relies on low level parallelism to achieve this level of throughput. However, for programs to take advantage of the chips numeric power, they have to be compiled with tools which can parallelize codes at the floating point unit level. This was not a requirement for the CISC processors that most of us have experience writing code for. This paper describes how to use NDP Fortran and C|C++ to create programs that run at the full throughput of the 21164. To make effective use of these tools users will not only have to learn about the Alpha architecture and the NDP compiler family, but how to re-write some CISC codes so that they do not stall the pipelined operation of the processor. The Alpha's lineage begins with 1980 vintage machines built by Cray and the Intel i860 - a Cray on a chip that is still the power behind some of the largest massively parallel Supercomputers on the planet. The i860 and Cray architectures feature "vector registers" - a very fast SRAM memory that is used to hold vectors. The items in a vector register can be accessed at the full speed of the Cray or i860's multipliers and adders. Because these units are pipelined, they can start both an add and multiply once per cycle. The adders and multipliers in the Alpha also rely on this concept. This stands in sharp contrast to the scalar units of the Intel® Pentium® and other CISC devices, which take two cycles for each operation and must complete each before starting the next. An Alpha can effectively perform both an add and multiply per cycle where the Intel® Pentium® executes a single operation every other cycle. The difference is a factor of four in per cycle performance. Once an array or array fragment gets loaded into the vector register of a Cray, it could be accessed without burdening main memory. To help feed main memory into on chip registers, Supercomputers use data busses that are 1,000 or more bits wide. This is one of the factors that makes them so expensive to build. The Alpha memory bus found on most motherboards is 288 bits wide, while not as wide as the busses used on Supercomputers it is four times wider than the busses found on typical Intel® Pentium® motherboards. For some problems, the speed of the memory bus is the rate limiting factor while for others it is the speed of the vector registers or memory and vector store combined. The programming strategy that worked best for large data sets on a Cray or i860s was divide and conquer - place one data set in the vector register and read the other from memory. In the case of a matrix multiply, the rows of the first array were loaded into the vector register one at a time while the columns of the second array were read in and used to form dot products with the current row in the vector register. From a data flow perspective, the rows would occasionally be loaded into vector registers, while the columns flowed through. This strategy reduces the memory bandwidth requirements by a factor of 2. The ability to flow data through a processor without washing out the contents of the highest speed on chip memory (vector register) is a characteristic of vector machines. The i860 has this ability, the Alpha does not. The strategy used by the Alpha is similar, but replaces the vector register with the 31 available floating point registers. The Alpha strategy to achieve peak performance relies on the speed of its floating point registers and general purpose caches, which do not have a protected region that can be used to emulate a vector register. Machines with general purpose data caches can be used for vector problems, when the problems are recast in block form. Frequently this is done by examining the different ways in which operands interact. For example, in the case of a matrix multiply, if we take three elements (none of which are near the edge of the array) from each array and ask the question how many ways can these six elements interact with each other, we will discover that in the process of doing a matrix multiply a dot product will get formed between each member of the first array and all the three elements of the second. For the three elements of the first array we will form 9 dot products with the elements of the second. From the perspective of the CPU, we read 6 floating point items and do 18 numeric operations, each numeric operation is loaded by an I/O equal to the size of 1/3 of a floating point number. This is quite better than the 1 to 1 I/O loading we get if we simply load a pair of operands and form a dot product (which does 2 operations). However, things can be made better still. If we leave three of the six operands used by our blocked routine in registers and then permute these against all the possible groups of three in the second array our I/O loading will go down another factor of 2 more to one I/O per six operations. When a matrix multiplication is re-written in "blocked" form, the normal three deep nested loop gets transformed into a six deep nested loop. Part of the complication associated with this transformation is the handling of edge effects - the three by three transform discussed above does not work when the elements are within two rows of the edge of the array. For this region different permutations are used. Block forms are created automatically by translators like KAP which takes a Fortran program and produces a new one which is blocked. One of the problems that we experience as compiler writers is convincing people that there has been a basic change in how machines function. Tools like KAP were originally developed to improve the performance of Supercomputers. These machines started with CPU's capable of 40 megaflops and progressed to CPU's that achieved hundreds of megaflops. The people who wrote code designed to run on Supercomputers lived in Ivory towers. They were the descendants of the guys in white coats who ran the mainframes in the days of batch processing. The techniques they developed to drive the vector machines created by Seymour Cray are mostly unknown to those of us who did not have access to a Supercomputer or grew up on an 8088. Those of us who had the good fortune to teach ourselves how to code an i860 several years ago are familiar enough with vector machines to be able to make the jump from the static world of scalar machines to the dynamic one of vector machines. The i860 taught us that there is more to code than getting the job done. The way you got the job done was often much more important than just getting it done, if you were looking for performance. We learned that the i860 ran in two modes, scalar and vector, one of which ran a factor of six faster. We learned how to use the VAST vectorizer, the same tool used on Crays. We also re-discovered Amdahl and his law, which worries about what you have to do after you speed up the vector portion of your code. In general terms what we learned is that the machine had a sweet spot and anytime we could get our programs to fit into it, they flew There were two choices on an i860, using scalar floating point instructions (add and multiply) or the 32 pipelined floating point instructions that made it possible to overlap the operation of the six stages of the i860's numeric unit. The pipelined instructions made it possible to not only control what got done, but the motion of data through the machine. These instructions made it possible for the adder and multiplier to work in parallel, processing up to six operands at a time. In pipelined mode all six stages of the floating point unit are kept full, while in scalar mode only a single stage is occupied at any time. The speed of the machine turns out to be proportional to the number of stages that stay occupied. If you didn't have the time to write pipelined i860 code you could use the i860 version of VAST. The programs produced by VAST call libraries composed of hundreds of vector primitive routines, each of which had been micro coded for the Cray or i860. Programs which had been vectorized on a Cray could be recompiled with the VAST libraries for the i860. The Alpha is very much like an i860. Where the i860 jumps a factor of six when moving from scalar to vector mode, the 21164 jumps a factor of 8 when moving from scalar to pipelined codes! Unlike the i860, however, it is possible to produce very good Alpha code without resorting to writing vector primitives by hand or calling vector libraries. The Alpha does not have a scalar and pipelined mode - all of its instructions run pipelined when they can. The goal of the RISC tools we describe in this paper is to take ordinary codes and convert them into Alpha assembly language which keeps the Alpha's pipelined stages full. The tools which we describe can produce elegant code when they are applied to vector primitives. This eliminates having to code primitives by hand, but it doesn't eliminate having to learn enough about the system you are writing code for to be able to not learn how to help the tools by making intelligent guesses about how the tools are applied. Not all programs can be vectorized - there are many programs in which each instruction must finish before the next starts. We call these programs scalar bound. The optimization of these codes is much less involved than the creation of pipelined vector codes. In general, where vector codes rely heavily on the form of loops to improve speed, scalar codes rely on the reduction of the number of operations. Both forms of code rely on scheduling - the process of re-arranging instructions so that the number of stalls experienced by the system are minimized. However, vector codes are much more sensitive to scheduling than scalar codes. The optimization methods used on scalar bound programs is the same on the Alpha as it is on typical CISC processors. The only programs which require special treatment are those that can be pipelined or vectorized. The optimization of vector codes is an esoteric process which requires the user to get a feeling for the relationships between compiler heuristics (the algorithms we apply to the code), the processor and the system it runs in. In this paper we discuss heuristics which do things like add cycles to memory fetches. The cycles we are adding are in the state machines used by the compiler to compute how quickly each group of four instructions we generate will execute. For example, we can tell the 21164 emulator built into the compiler to find the best code sequence assuming that a memory fetch will take 10 cycles instead of 2. We can then run the code produced and we often discover that adding a memory delay will improve the speed of codes in which the data being fetched does not come from the CPU's Dcache. We also discover that the fastest codes fetch data out of the Dcache and that these codes should be compiled assuming a latency of 2 cycles for fetches. Scalar programs do not benefit as much from fine grained assumptions like how long it takes the CPU to fetch data from the Scache, because these programs are often rate limited by other factors. The bottom line is you don't have to worry about much of what we are about to discuss if you don't intend to use your Alpha to process data stored in arrays in a Cray like manner or if you are happy achieving 10 to 40% of the 1,000 megaflops that a 500 MHz 21164 is capable of. 1.1 Improving Scalar Bound Codes The speed of many applications can be improved if the inner loops are rewritten to reduce or eliminate scalar bound expressions. Many of the old mainframe codes have been extensively optimized at the source level to improve execution speed. Often the numeric tricks used to improve speed reduce the total number of operations performed on scalar machines, but simultaneously build in scalar constraints which behave like speed bumps when they are executed on a pipelined device. A good example is the evaluation of polynomials. One of the main goals of the numeric transforms used to improve mainframe execution speed was the reduction of the number of multiplies, which took much longer on old machines than adds. For example, the most cost effective way to evaluate a fifth order polynomial on a CISC machine is to evaluate the expression: P = x* (x* (x* (x* (k5*x + k4) +k3) +k2) +k1) +k0 Summing up the number of adds and multiplies we conclude that there are 10. If we look at the equivalent fifth order polynomial written out in its normal form, we would discover that there are a five adds and nine multiplies - four to build the power table and five to multiply powers by coefficients. The expression above saves having to do four of these multiplies. If we evaluate how many cycles it takes to evaluate the expression, we discover that 40 cycles are consumed on a 21164 - 4 cycles for each of the 10 operations. The computation is scalar bound because there is no way we can overlap operations - we must first wait for k5 to be multiplied by x before we can add it to k4 and for their sum to be computed before we can multiply it by x, etc. The evaluation is an example of a 100% scalar bound computation - every operation must wait on its predecessor before it can start, and since each takes four cycles, the whole computation takes 40 cycles to carry out its 10 operations. Under normal circumstances the job of a scheduler is to find ways of doing things in parallel. For this problem there is nothing that can be done. You might ask, why try to do a better job. The answer is that depending on the number of polynomials that you can simultaneously evaluate, it ought to be possible to speed this calculation up by a factor of 2 to 4 or more. The trick lies in realizing that many of the multiplies and adds can be overlapped if we revert to the normal "summed" form in which powers are explicitly evaluated. Since it takes four cycles to complete an add or multiply, we will break the problem into 4 cycle computational boxes and see how many multiplies and adds we can cram into each. We know that before we can start adding terms, we have to build them. So in box 1 we will start as many multiplies as we can. These turn out to be just x*x and k1*x. At the end of four cycles we will enter box 2, where we plan to start adding the terms just computed in box 1 in addition to starting several new terms. During the second box we will start adding the terms k1*x and k0 and continue multiplying together x and its low powers to form x2*x, x2*x2 and k2*x2. In the third four cycle box we can start adding the term k2*x2 to the sum computed in box two and also start the last three multiplication's. The fourth through the sixth are dominated by additions which sum up the terms we started in boxes 1 through three. The total number of four cycle boxes required to evaluate the polynomial is six, which adds up to 24 cycles, 16 less than the 40 cycles required by the classic technique. We can evaluate how well we are doing by computing the number of operations per cycle. Ten operations in forty cycles is just .25 operations per cycle, which is as bad as it gets. Ten operations in 24 cycles improves things to .4 cycles per operation. To generate interesting throughput we have to overload our machine using techniques like scheduling and loop unrolling. If we were evaluating two independent polynomials at the same time, the scheduler built into NDP Fortran would be able to overlap these operations, and our efficiency might increase a factor of 2 to .8 operations per cycle. While this is still a factor of 2 below the two operations per cycle the Alpha is capable of, it is a factor of three better than we started with and it might be possible using loop unrolling to find situations where this condition is automatically produced by NDP Fortran and in which a majority of the numeric stages of the Alpha were occupied. This last section was introduced just to make the point that important computations such as the evaluation of polynomials (which get used in computers to produce transcendental functions) can be rewritten to take advantage of pipelining if we ignore our instincts, which dictate that we always reduce the number of operations to improve speed. In the case just cited increasing the number of operations by 40% actually results in a 66% improvement in throughput! 1.2 Optimizing for a gigaflop What makes the use of a compiler tricky in a gigaflop environment is that we no longer are generating code whose intent is to simply get a series of numeric operations performed. Rather, we are searching for a sequence of instructions that will get the job done in as flawless a manner as possible, making it possible for the machine to run with the smallest number of stalls. Often the tools have to examine hundreds of thousands of ways to rewrite a particular line of code as they perform their search. In the case just discussed, even common sense ideas like minimizing the work performed by the machine, no longer can be counted on to produce the best throughput! One of the things that the compiler does when we are scheduling for parallelism at the FPU level is check the rhythm of the code, attempting to find ways to issue four instructions per cycle. If it finds a group of three instructions that "go together" but do not go with the next instruction, it inserts a "no op" into the code - frequently we see scheduled code with a pair of no ops per cycle. The concept that we add placebo instructions to a code stream which do nothing more than take up space in the stream to speed it up runs counter to every concept that we learned in the days that we wrote code for CISC devices. In a CISC machine form follows function. On the Alpha form is king. Speed follows form and function at times can be reduced to what appears to be a side effect. To be able to measure the effect of compiler heuristics on code quality we have to be able to run benchmarks. Even this simple activity gets complicated on an Alpha. We find that there are basic problems just getting old marks to execute without generating exceptions like division by zero. The classic LINPACK 100x100 runs so fast on a 21164 that it will die with timing errors when entire sections execute in a single clock tick, unless special timing routines are used. When performing benchmarks one also encounters numerous situations that suggest that much more is going on beneath the covers than just the code would suggest. After getting basic timing issues taken care of one gets met by the "state of the caches" problem. Depending on the state of the caches when you run a benchmark, you can get times which fluctuate by over a factor of 2! We label such runs as "unstable." With time you get a feeling about what causes unstable benchmarks. Examples include data sets that are on a boundary between two caches or cache and memory. You see these effects for data sets that are near the cache boundaries which are at 8, 96 and 1024/2048 Kbytes.
Figure 1 shows a series of runs made with a "perfect" dot product. The code used to produce this plot is discussed in Appendix A. For this code, the double precision marks were very stable and the single not so stable. The results of the single precision dot product runs are shown in figure 1. The processor was running at 433 MHz with a 1 MB cache. The wedding cake plot is similar to ones that I have made before for the Intel® Pentium® and i860's (see January 1995 Dr. Dobbs article by S. Fried on Intel® Pentium® Scheduling and January 1994 DDJ article on Personal Supercomputers). It is important to use a fast primitive to generate the throughput vs. vector length plot - a primitive which is not capable of getting to within 90% or more of the theoretical best speed of a processor does not tell you that much about how the processor runs with small data sets. This primitive runs at 96% of the speed of a 21164. A similar plot taken for a 500 MHz part would asymptote to 960 megaflops instead of 834. As the sizes of single precision vectors used to generate figure 1 increase the throughput falls due to a reduction in memory bandwidth. However before the fall off starts at the point where the Dcache runs out (vector length = 2,000) we see a steep rise followed by a more gradual one. The steep rise is associated with end effects. For small vector lengths the cost of starting a new loop in a vector machine can be quite expensive. As the size of the loop increases this cost gets washed out and we asymptotes to the speed of the processor, which in this case is 866 megaflops, assuming our code does not exhibit stalls running out of the Dcache. As vector length continues to increase we see the speed falling off as the code starts fetching data out of slower caches or memory. The current Alpha Dcache is not associative, which means fetches from it fall off very fast as the data exceeds its 8 Kbytes size. The Scache is capable of keeping up with the Dcache but the latency jumps from 2 to 10 cycles. The use of our memory delay switch improves the double precision version of this plot but did not have any effect on the single precision runs. Since this plot was made we have seen a similar plot which ran at 750 megaflops out to vector lengths of 10,000. Whether or not this better code was a result of the fact that the problem only needed one accumulator or we got better separation between loads and uses is not clear at this time. There is still a little magic left that we have not applied which might explain this unusual 50% jump in Scache speed. For the run in figure 1 we had a 25% reduction in throughput when we went from the Dcache to the Scache. The fall off to the Bcache is very steep and occurs when the vector length reaches 16,000 reals, which just equals the size of the 96 Kbyte Scache. This is the first point at which we see bistability setting in. The runs at vector lengths of 16,000 were spread over a range of values. As we increased vector lengths again we saw bistability again appear at vector lengths of 92,000 or greater. For these cases the cache operation seems to be pulled between the Scache and slower Bcache throughputs. The lower set of throughput values that were dominated by Bcache throughout (in the vicinity of vector lengths of 128,000) were obtained when we had three CPU tasks running at the same time. However, even with only a single task running, it was possible to get randomly slower results. The speed for this region of vector lengths also appears to be a strong function of the state of the cache when the run starts. There appear to be times when the machine finds itself in a state that results in slow speeds even when it is not heavily loaded by other tasks. The gurus at DEC recommend cleansing the machine ahead of benchmarks by executing a program which fetches an array which is three times the size of your physical memory if you want to get the machine into a well defined state. In general, the bottom set of points for the Bcache region are 50% above what it gets for pure memory, while the good Bcache values run almost a factor of 4 faster. When the system gets stuck in a slow throughput state the processor is oscillating between fetches that are coming from memory and Bcache. This is the region that also benefits from larger Bcache sizes. Form is the critical parameter that determines how fast a 21164 inner loop will run. In some cases, it is possible to figure out what the form of an optimal loop will be and then create a source level primitive that will end up creating the desired form. While this situation requires the user to thoroughly understand the architecture, once learned the technique can be invaluable for evaluating how to write code for any RISC CPU. In the case of the "perfect" dot product plotted in figure 1, we discussed our earlier results with one of DEC's gurus who wrote dot products for a living that were used to drive the emulators they were using to design their next generation product. His words of wisdom were, "I have discovered that it only takes four registers to keep an adder happy (in a dot product) and another four to keep the multiplier happy, this leaves the balance of the registers available to act as a chain which transfers operands from memory to the register file." One of the goals when we write very fast code is to attempt to make the Alpha load chain as long as possible. This reduces the effective latency of the unit we are sucking data out of. For us, his observation turned out to be very insightful, but did not quite jibe with a series of benchmarks we had run on a group of dot product sources we had unrolled at the source level in Fortran before using the unroller built into our scheduler to expand the inner loops. After considering his words of wisdom, we realized that if he was correct, the best Fortran loop to start with would be the one unrolled by four which used four accumulators. This would give us the first criterion he was looking for, four registers (the four accumulators) devoted to the adder. The multiplier was taken care of by the compiler, leaving anything left over to loads. To get the code to peak, we tried running it with different amounts of compiler generated unrolling. In the past we had quit at 3 (i.e., at 12 dot products in the inner loop), but had gotten quite good results at 2. We now explored what would happen if we went to larger levels of unrolling. He was right. At an unrolling level of 4 we got better results than our best run to date which had been made with a source loop unrolled to 8 that had been unrolled by 2 (i.e., both loops did 16 dot products, but the one that used only 4 accumulators ran faster than the one with 8). When we increased the unrolling on our four accumulator code to 5, we got still better results. At 6, the benchmark started to fall off because we were running out of registers. 1.3 Driving a "five pass" compiler In the early 80's compilers were touted as being single pass, with the implication that there was some intrinsic benefit to being able to do everything associated with a compilation in a "single pass." There are even texts that describe parsing techniques which are purported to improve the performance of such compilers by combining semantic analysis with parsing. It is easy to show that to do a minimal amount of global optimization it is necessary to at least build a data flow graph before optimization can commence and that this process requires at least two passes. The bottom line is that the smallest number of passes that make sense for anything more sophisticated than an 8086 is two. Most modern compilers recurs the program being compiled numerous times in the process of performing a compilation. Often it becomes necessary to measure things like variable lifetimes to allocate registers or search for blocks which are never reached to eliminate dead code. In the end dozens of passes get made over every block of code. The programs which carry out these tasks are often invoked under the control of a program called a compiler driver, which not only runs the compiler itself, but any ancillary applications like the system assembler and linker. All NDP Alpha compilers are composed of five stages which are implemented as parts of three executables. The user need not worry about the details of how the executables get invoked by the driver, but we will discuss their contents here for completeness. The first executable performs the lexical scanning, parsing and intermediate code generation. The first two stages are different for each of the NDP languages and are usually referred to as "the front end." The same front end gets used by all NDP Fortrans independent of which processor the code is generated for. The other two front ends convert C|C++ and Pascal into intermediate code. The source for the front ends, while common, may contain special sections that pertain to a specific processor or operating system. The third stage acts as an intermediate tree constructor and optimizer and performs many of the classic CISC optimizations that are used by our x86 family and have been in use at Microway since the early 80's. A single intermediate stage gets used by all NDP compilers running on all processors. The first three stages are written in NDP Pascal, which must get ported to a new environment to run compilers in native mode. The output of the third stage is a block oriented abstract syntax tree which gets read by the fourth stage, ABE, the Microway "Alpha Back End." In addition to converting the parse tree into Alpha code, ABE allocates registers and performs a number of simple optimizations. ABE then outputs Alpha assembly language which can be executed without further optimization if performance and code size are not important. The third executable is GVS, Microway's Global Vectorizing Scheduler. This is a post processing pass which derives from a tool we developed for the i860 - PPS-860. In the I860, it rearranged numeric operations so that adds and multiplies could be pipelined. In addition to carrying out this same transform for the Alpha, it also performs scheduling and important RISC optimizations that include loop unrolling, vectorization and memory disambiguation. To accomplish these tasks it needs to build a new data flow graph and then uses this graph to solve a number of memory access problems which include the addressing of array elements. The graph also gets used to eliminate many redundant computations that are related to addressing and do not get discovered by the third stage. 1.3.1 -O..-O5 Compiler Driver Global Optimization Switches The classic CISC optimizations form the backbone of all programs produced by the NDP Alpha compiler. To do a good job you need to start with a robust foundation. The CISC foundation provides the functionality on which the computation is built. What makes it fly often is the way we decorate it with advanced optimizations which end up filling it out to eliminate stalls. The compiler driver uses a sequence of -O switches to control the overall program optimization process. These new global optimization switches actually are wired to all three sections of the compiler and send out lower level optimization switches to the front end, ABE and GVS under the covers. The details of this process are shown in Table 1. 1.3.2 -O, -OLM, -OLMA and -onrc Compiler front end switches From this point on we will talk about the first three stages of the compiler and the switches which control them as if they were the "front end." Many of the optimizations that we used to do in the compiler intermediate section, which is now part of the front end, are now done in GVS. Examples include loop unrolling, which is now turned off in the intermediate optimizer. In most situations a small set of front end optimization switches will do a good a job controlling the how the front end optimizes its parse tree. These switches get passed to it by the compiler driver using two different techniques. In the first they are hard wired into the Global switches -O ..-O5. These switches are discussed in detail below and can be examined in Table 1. The second technique for controlling the front end uses "pass through" commands which make it possible to precisely control the front end and GVS independent of the global "-OX" switches. For example, when you chose the -O3 optimization level for your entire program, the compiler passes the switches -OLM and -onrc through to the front end (see Table 1) as part of a command file into which it places all the switches needed to drive the front end. If you wanted to use the features of the -O3 switch, but did not want the "M" component passed through to the compiler, you could add the following front end pass through to the command line: -Wc,-OL,-onrc The driver will then replace the hardwired global commands for the front end that it uses when -O3 is selected with -OL -onrc. The most common "front end" switches used are -O, -OLM, -OLMA and -onrc. See Table 1 below for details of how they get connected to the driver's global switches -O..-O5. The front end switch -O instructs the front end to perform basic optimizations intended to improve code by reducing its size. When we add the L and M suffixes to -O we instruct the front end to do optimizations which speed up execution speed by increasing code size. This often involves analyzing loops and rewriting them so that they make fewer memory references by caching important variables in registers. It also eliminates index variables wherever possible replacing them with induction variables. We refer to them as loop and memory optimizations - which is where the L and M come from. In addition to making the programs difficult to follow with a debugger, they also must be carefully employed if you are writing device drivers or OS kernels. The -onrc is one of several switches we used to use to turn on CISC compiler features that were not always used. It controls register caching, an optimization that could be done by GVS but which we still do in the front end. Register caching is described in great detail below and is required for vector expressions which have not been written with dummy variables. Another important optimization component for the front end is "A" as in -OLMA. This introduces "algorithmic" transformations like the "Whetstone" transform, which reduces expressions like ln(exp(x)). Algorithmic optimizations to not asymptote to correct values when the functions they are composed of approach infinities. If you have an expression for which evaluation order is crucial, then you should avoid the A component of -OLMA and also use the GVS switch -ea to force exact evaluations. Again, the switches passed to the front end, like -OLMA, are emitted by the compiler driver switches -O .. -O5 or the compiler pass through switch -Wc,.. 1.3.3 -O, -RA ABE's switches The switches passed to ABE are always -O and -RA. These tell it to allocate registers for procedures and perform several low level optimizations. Unless you want very raw code as output, ABE should be used with these switches. They are passed to ABE by the compiler driver automatically. The only occasions when it pays to use ABE without register allocation or basic optimizations is when you need to debug code produced by it. 1.3.4 GVS Overview In a typical compilation 60% of the compilation time will be consumed by GVS, our global vectorizing scheduler. In situations where critical routines are being heavily scheduled, this can balloon to 99% of the time. For the routine shown in figure 1, the best code scheduled in 15 seconds, with the balance of the process, compiling, assembling and linking, taking less than 1 second. However, it is possible to spend much more than 15 seconds scheduling a dot product. In situations where the size of an inner loop is very large or the code is very awkward, the scheduling process can take minutes. In fact, one of the ways you can tell that you are doing a good job producing a code skeleton that GVS will be adding flesh to is time - well formed skeletons schedule faster than those which are awkward. There is nothing in the scheduler that is designed to tell it that it has been given a hopeless case to solve. The ability to schedule an algorithm using a machine should still be considered a gift, even if it takes a few minutes to do at times. In the case of the I860, many of our customers spent weeks writing hand coded numeric kernels - the work laying out and optimizing these routines is now done routinely by GVS. One customer went so far as solving for the best set of instructions for his problem by up setting up 30 equations and 30 unknowns to solve for the best way to schedule a 30 line sequence if I860 instructions he used to do neural net training. This resulted in a single unique set of instructions which while running 20% faster than his original, took him close to 2 months to obtain. At the time we considered the fact that there existed a unique sequence as remarkable, along with his method. For the Alpha the existence of such a situation is remote because of the nature of the instruction set. The real problem with scheduling that frequently arises is that code which should not be heavily scheduled ends up getting the full treatment and the machine bogs down in one endless optimization session. What this means is that if your plan to produce excellent code, divide your program into a small group of critical kernels and compile and schedule them separately from the balance of your program. 2.0 The NDP Compiler Drivers mf, mp and mx Compiler drivers are programs which manage the operation of compilers and their associated tools. The NDP compiler drivers are C applications which originally derived from the UNIX cc command. Depending on the processor and operating system, the NDP compiler drivers can have different names. For example the driver mf860N was used to compile Fortran sources for i860 add in cards that ran Native on the cards. We simplified our naming conventions recently - all of our Alpha drivers, independent of whether they run cross or native or on DEC UNIX, NT or Linux use the three simple names above. Driver functionality has remained fairly constant over the years. When you type mf, mx or mp, the compiler driver gets activated. It analyzes the switches on the command line which follows it along with the file names on the command line and uses this information to produce a program. By default, if you do not specify an output file name it produces the UNIX default output, a.out. To run the program produced you type its name from the console. The compiler driver is a data base application of sorts. It builds tables internally that tell it how you want your program compiled. Once it has completed these tables it creates command files which it uses to drive each of the stages. In the first stage it compiles to assembly language all the files which have extensions ending in .f, .p, .c or .cxx. In the case of the Alpha this is done by passing commands to the compiler front end, followed by ABE and culminating with commands to GVS. Most of this chapter revolves around how the compiler driver gets used to control the compiler front end and GVS. It then assembles any .s files found on the command line along with the .s file produced by the compilation phase. Next it adds all the object and library files found on the command line along with any produced by the assembly phase to a link command file which it then passes to the linker to build the application. One of the benefits of using a compiler driver is it eliminates having to learn how to use the assembler and linkers as you move from one operating system to another. The formal operation of the compiler driver follows that used by UNIX compiler drivers cc and f77. The current NDP Alpha document does not accurately describe all the driver switches, however each driver has a help file which it will direct to the screen if you type the compiler driver without using any commands. The current driver is virtually identical for all three languages. It assumes that files with .f, .c, .cxx, .p, .o, .a and .s extensions are used to represent Fortran, C, C|C++, Pascal, object, library and assembly language files. The main difference between the three versions of the driver is related to figuring out the "primary" language of the program being built. This affects the way the linking is done. In a UNIX environment the tradition is to take the extension of the first source file on the command line as a hint indicating what kind of program you are building. We assume that the output will be a C, Fortran or Pascal application based on the type of the first file. What this means is that if your first source file is a .s file, the driver will not know what runtime library needs to be linked with it. It will then ask for linkage information, which can be provided with driver switches like -flink, -clink and -plink. NDP executables are built using techniques inherited from UNIX. The routines in the NDP C library are so close to those used by all BSD like UNIX's, that their names often conflict with those used by the host's library. In addition, we use the same under score protocols for naming variables and routines that were used by the BSD compilers. In situations where we rely on the host C library for making contact with the operating system we add a layer of procedures that disambiguates the names. The NDP runtimes are all written in C, and the Fortran and Pascal runtimes end up calling the standard UNIX libraries libc, libm, libp and libf. In some Microway compilers, you will not find either a libm or libc, but these libraries will still be present as part of a larger composite library. 2.1 Quick tutorial on compiler driver usage For those users who have not used a compiler driver before, it is recommended that you spend sometime looking at a UNIX book on cc or f77 or reading the NDP manual. The compiler driver can save you a lot of time and eliminate your having to learn numerous switches, that are often only understood by insiders and gurus. For those of you who do not have a UNIX tutorial available we will now demonstrate how they can simplify life. Consider the following problem - you have just created an assembly file, my.s which you want to link with a Fortran object module, module.o to build a program, prog.out. You don't have to learn how to use the linker and or assemblers in the NDP Alpha NT, UNIX or Linux environment to run the system assembler or linker: the following command will build this application in all three environments using the driver mf to send the appropriate commands to the assembler and linker in each of these operating systems. mf my.s module.o -o prog.out -flink Again, because the first module is not a .f file, the compiler driver mf does not know if it should link against the Fortran runtime or not. It gets this added piece of information from the -flink switch. To build this application it first assembles my.s into an object module, my.o, which it then links with my_module.o, libf, libc and libm. Suppose my.s was derived from a C application, how might it have been built? If there was no difference between the assembly sources created by the C compiler and those used above, we could have built the entire application by using the same command line and replacing my.s with my.c. Or, if the reason we were working with the .s file in the first place is because we wanted to improve the code at the assembly level, we could have created the initial assembly output from my.c by invoking the Fortran compiler driver with the first file being a C source and using the -S switch to stop the process after the assembly files were produced. The following line accomplishes this task: mf my.c -S Since the Fortran compiler driver is almost identical to the C and Pascal drivers, it has no problem recognizing what needs to get done here, assuming you have both the NDP C and Fortran compilers. Often there is a need to produce object modules, especially when we are doing selective compilations and do not want to waste time recompiling modules over and over that never change or whose execution time is not critical. To produce a file like my_module.o from the source file my_module.f, we can stop the compilation after the assembly process and before linking, using the -c switch. If you are interested in watching what goes on "under the covers" you can use the -v verbose switch to get the compiler driver to generate a running commentary on its actions. All of the switches mentioned so far are standard on any UNIX compiler, which means once you are familiar with them you can also use DEC UNIX compilers. 2.2 -OX Global Optimizations Switches During the compilation process the compiler driver must send commands to the three executables it uses to translate a .f or .c file into a .s file. These three executables reside in various NDP directories and go by the names mfal, mxal, mpal, ABE and GVS. The first three correspond to the first 60% of our original NDP compiler family. They produce the intermediate files read in by ABE, which emits Alpha assembly language that gets globally vectorized and scheduled by GVS. The compiler driver must pass commands to each of these facilities. Amongst these commands are ones which describe how the user wants his program optimized. Since this process takes place in all three regions, this means that the basic optimizations commands used on the driver command line need to get broken up into three commands. The pre-chosen "O" global optimization switches which accomplish this task are explicitly shown in Table 1 below. The trend in recent years has been to create hierarchical optimization switches - O, -O1 .. -O5, etc. We have duplicated this philosophy here, by pushing our CISC switches -O, -OL, -OLM and -OLMA switches "under the covers" so that they only get used to drive the NDP Alpha front ends like mfal. The problem with using just six hierarchical switches -O .. -O5 is that they do not provide enough control to make it possible to optimize all the different kinds of procedures and data sets that can be encountered. After one becomes familiar with the advanced GVS switches, one concludes that there are four important switches that each control an essential feature of GVS. For each of these switch settings GVS behaves in a different enough manner to make it necessary to be able to control each if really good results are desired. In addition, several of these switches have integer values associated with them. If each of the four behaved in a binary manner (i.e., could be on or off) we would need 16 different -O combinations to control GVS. In the real world five switches do not provide this level of control and another mechanism needs to get used to fine tune critical routines and loops. Table 1 lists the "-O" switch combinations that we are currently shipping with the product. The purpose of these switch combinations is to act as a starting point for a more detailed compilation of a program. When a combination has been found that provides reasonable results, it is still possible to make significant improvements by fine tuning the compilation using the pass through switches -Wg or -Wc to precisely send switches to either the front end or GVS. In addition to the pass through switches, there exist other techniques like command files for GVS which make it possible to control how each routine gets scheduled and auxiliary switches which can be added to the driver command line to fine tune the -O switches. Finally, it is even possible to place metacommands in NDP source files which pass control messages in comments through to GVS. Each of these techniques is discussed below, but for now we will assume that you plan to use the -O global switches as intended and use the pass through or auxiliary switches to control GVS. The parameters passed through to the three stages can have a very large impact on performance. To begin with, for an Alpha, the difference between the worst scalar bound code and the best vector code is a factor of 8 in throughput! The dot product kernel shown in figure 1 started life running at 78 megaflops when compiled with -O. It reached a high of 478 megaflops using -O3 before we started the fine tuning process. The very best throughput turned out to be 834 megaflops, which is over a factor of 10 better than the result we got just using the -O switch! To find the best value, we not only experimented with compiler optimizations, but the form of the source. We ran 10 different source versions of a dot product kernel with six different switch settings. The kernel was one we had extracted from a matrix multiply that had been blocked by KAP. The work was partly done to understand the effect the number of accumulators had on Alpha dot products. The switches and code which gave the best results had an inner loop with 4 dot product pairs and 4 accumulators. The code was unrolled by 5 and then vectorized by GVS, which means the code which gave us the best results actually had twenty dot product pairs in its inner loop. mf or mx | abe | gvs | front end | comments switches switches | switches | switches | --------------------------------------------- -O | -O -RA | -lb -so |-O -X370 |fast,not very good -O1 | -O -RA | -fo -so |-OLMA if mf |add algorithmics for Fortran -O1 | -O -RA | -fo -so |-OLM if mx |but not for C -O2 | -O -RA | -fo -fs |-OLM -onrc |fast scheduling -O3 | -O -RA | -fo -lu 2 -bf 50 |-OLMA |full scheduling plus unrolling -O4 | -O -RA | -fo -lu 4 -bf 75 -md 7 |-OLM -onrc |more unrolling plus mem delay -O5 | -O -RA | -fo -lu 9 -bf 100 -md 7 -ms 1 |-OLM -onrc |disambiguation and delay Table 1 Commands emitted by driver to ABE, GVS and NDP front end by "global" -OX switch settings 2.2.1 Commentary on Table 1 The global optimization switches listed in Table 1 were created in order of compilation speed. Those nearest the top compile the fastest, but produce the least improvement. The switches toward the bottom take much more time but produce much better code. However, at best these switch settings should only be considered a starting point for the optimization process. 2.2.1.1 Basic optimization settings, -O The switch setting, -O, is provided partly for safety purposes. The front end, while very mature, does optimizations that are not appropriate for all uses. While the situations are rare where the most powerful front switches used (-OLM and -onrc) can not be used, they do occur. In addition, there are situations in which clean literal translations are desired for debugging purposes. The sorts of programs that one must worry about -OLM -onrc failing are not likely going to be encountered by "lay" users, as lay users rarely write programs which directly control hardware. The M component of -OLM fails when it gets used to optimize a loop used to poll memory locations whose values can change asynchronously, i.e. are not really memory at all, but memory mapped devices. In addition, situations arise in C with aliased arrays that produce bad code when -onrc is used. Each of the first three global -O switches employ the GVS switch -so (statics only). What the -so switch really does is stop before scheduling - none of the first three O switches results in scheduled code. The -so switch tells GVS to simplify the code and quit before scheduling. The preliminary optimizations performed include 30 static memory transforms along with a long list of strength reductions. These transformations remove the typical redundant temporary variables that are created by the front end compilation process. The first switch also uses a feature which dramatically speeds up GVS by reducing the size of code blocks. The -X370 switch is one of the 1200 internal X switches that get used to control the first three compiler stages. About six X switches are emitted by the compiler driver and then can be viewed if you use the -v verbose switch. The balance are turned on or off by internal initialization routines. The X370 switch tells the front end to divide the output on a line by line basis and put the line number information in its output. GVS uses this line of information to limit optimization to one source line at a time. This results in GVS running dramatically faster. A -O switch is also passed to both ABE and the front end. This -O is not to be confused to the global switch we use on the command line. The -O's that go to ABE and the front end tell ABE to do a number of peep hole like reductions and the front end to do "basic" optimizations which reduce code size. This combination compiles quick and is reasonable for sections of code which rarely execute and do not contain critical loops. They also save hard disk space and load times. The -O1 switch adds -fo to -so. The combination of -so (statics only) and - fo (full optimizations) provides the maximum level of scalar optimization up to the point where scheduling starts. It is often true that scalar codes do not benefit from scheduling, although this can not be counted on if the code has been correctly written. This combination results in very readable assembly code being produced. The -fo (full optimizations) tells each of the heuristics stage to go to completion. When -so is used by itself many of the optimizations in GVS have "breakers" in them which tell the heuristic to run once and not until no further improvements could be made. If you don't specify -fo, you don't get as thorough job. The -so -fo switches produce relatively good code that would have been acceptable in the mid 80's. However without access to loop unrolling or scheduling the best this switch can do is enhance the performance of scalar codes. The front end switches used with -O1 are either -OLM (for C) or -OLMA for Fortran. These switches are designed to help all codes which have loops in them, but quit before trying anything exotic. The fact that we have used -OLMA for Fortran instead of the -OLM switch used by C is an attempt at improving Fortran numeric algorithms which call Fortran intrinsics and contain expressions which can be reduced using identities that might be considered controversial in some instances, like the x = sin(asin(x)). Because we have not turned on register caching, we do not expect this combination to do a descent job with vectors, even if they have been unrolled at the source level. -O1 is a better switch than -O for high level routines which get called several times by a program, i.e., the 99% of a program's source which only does 1% of the processing. 2.2.1.2 Basic scheduling setting -O2 The -O2 setting is the first in which we do not use -so and which invokes the GVS scheduler. It has been introduced as a "basic scheduling" switch and it is appropriate for programs which contain vector expressions and do not need compiler generated unrolling. If you are compiling vector codes which contain loops unrolled at the source level, this switch should provide reasonable performance and will not add much to the compile times of other routines. This setting will also do a better job than -O1 on general purpose routines, without getting mired down in long winded scheduling exercises. To continue to keep compilation times down, we have used -fs (fast scheduling) which only applies the first level of the GVS scheduling heuristic. There are times when -fs actually produces better code than the default scheduler setting. The other switch that gets added at the -O2 level is -onrc, which detects vector variables and caches them in registers when it can. This is a very important optimization for codes like dot products, when they are not written with explicit dummy variables. Register caching is a heuristic that determines that a vector variable in a loop can be considered to be a scalar for the duration of the loop, and therefore placed in a register. When this doesn't happen, i.e., when the caching doesn't pick up on this fact which can happen when it detects a side effect it doesn't like (like the variable being used in an EQUIVALENCE statement in Fortran), the user can force the system to do his biding by writing code which unambiguously states that vector elements in loops are to be cached. For example, the following lines which lie at the heart of a matrix multiply, DO 130 K = 1, 100 130 C(I,J) = C(I,J) + A(I,K) * B(K,J) will get picked up by -onrc and get effectively converted into the following code: DUM = C(I,J)
DO 130 K = 1, 100
130 DUM = DUM + A(I,K) * B(K,J)
C(I,J) = DUM
However, if the user were not sure that the conversion was going to take place for some reason, he could rewrite his code using dummy variables as we did here. In earlier NDP compilers -onrc was recognized as a command line switch along with switches like -OLM. Since there are occasions when we might want to turn it on without using the -Wc compiler pass through, we have added it back into the switches recognized by the driver. 2.2.1.2 Loop Body Expanders, -O3..-O5 The next three switches, -O3 .. -O5, emit GVS switches which expand the size of inner loops. These switches only pay off if the loops being expanded are located in critical routines. If the routines expanded by these switches are critical to your application and they can be unrolled and scheduled productively, the improvement in performance they result in can be truly remarkable. If the routines are not critical or can not be productively expanded, your code may blow up, slow up and your compilations take forever. The three GVS switches which control loop body expansion essentially add loop unrolling in three stages along with memory delays and memory disambiguation. At one point we also mixed in vectorization. However, vectorization only helps out in very specific instances and when combined with memory disambiguation can result in horrible compile times if indiscriminately applied. The "intended uses" for the three "advanced" switches are: -O3 More thorough scheduling than -O2 along with a small amount of loop unrolling. The loop unrolling uses a -bf (be frugal) setting of 50 cycles, which will cut loop unrolling off when it gets applied to loops which have been unrolled at the source level. Unrolling by 2 provides only a taste of what the real McCoy does, but it also keeps compile times down and should even be useful for some long winded scalar loops. -O4 increases the unrolling from 2 to 4 and sets the be frugal bar up to 75 cycles. An unrolling factor of 4 often gives good results on the Alpha, but can be a factor of 2 to 4 below the best value if they code being unrolled has not been unrolled at the source level. This switch adds 7 cycles of memory delay, which makes it more appropriate for arrays being fetched from the Bcache or memory.
-O5 increases the unrolling factor to as large a value as we should ever consider blindly using on a program. It uses be frugal switch set to 100 cycles and adds 1 level of memory disambiguation and 7 cycles of memory delay. This setting does a reasonable job with the LINPACK kernel DAXPY. 2.2.1.2 Global Optimization Switch Helpers: -onrc, -onA, -onmd, -onea, -onms, -onv, -onsa The best laid plans of mice and men never seem to quite work out the way they were intended to. To make it possible to spice up the global optimization switches without having to use the -Wg or -Wc pass through features, we added back a couple of old NDP "on" switches. These effectively embellish or modify the information we pass to the front end or GVS. Front end embellishments The two front end optimizations it would be nice to add to the front end is register caching and algorithmics. So we recreated the switch -onrc and added on a new switch -onA. These are each compiler driver command line switches. The -onrc results in the compiler driver adding -onrc to the command file which gets passed to the compiler front end. The -onA switch adds an A to any -O switch passed to the front end. For example, if you were using the global switch -O4, which passes -OLM to the front end but wanted -OLMA you would use the -onA switch in addition to add an "A.". GVS embellishments There are many times when we want to do more than unroll our code, which is essentially what switches -O3..-O5 provide. The following compiler driver switches cause GVS to get the following GVS features which can be thought of as "spices." Driver GVS comment ---------------------- -onmd -md 7 memory delay of 7 -onea -ea do exact arithmetic -onms -ms 1 mem disambiguation -onv -v vectorization -onsa -sa force split accumulator 2.2.1.3 -Wc and -Wg Pass through switches It is possible to partially or fully bypass the -OX switches using the -Wc and -Wg pass through switches. If you use the -Wc switch, the driver will replace the front end commands it uses in Table 1 with the commands used in -Wc. Similarly, -Wg over rides the -GVS commands listed in Table 1. For example, to compile my.f you could use the following custom set of switches: mf my.f -Wg,-md,7,-ms,1,-v -Wg,-OLM,onrc When you use pass throughs you have to remember that the first space in each terminates the command, which is why we delimit commands with comas. Here -md,7 is the equivalent of the GVS command -md 7. 2.2.1.4 Hints on using Global Optimization As we said before, there is no correct set of switches, and for really good work you have to use either the pass through switches described in 2.2.1.3 or GVS command files described in 2.5.2. However, there is a fairly fast way to evaluate one kernel at a time to figure what it's nearly optimal settings are. The first step should be to run -O3 and -O3 with -onv. Neither of these scheduling sequences should take very long, as the level of unrolling used in -O3 is small. However, it is large enough to tell you immediately if vectorization is going to pay off, as vectorization shows its largest percentage increases in performance at small levels of unrolling. If you optimize more than one at a time it is possible you will not see a benefit from vectorization, as we have seen instances in which it makes things worse. Since vectorization can be very time consuming, and loop unrolling increases scheduling as the square of lu, it is a good idea to figure out if vectorization is going to be required using -O3. Having determined whether or not vectorization pays, we next examine memory delay using a slightly higher level of unrolling -O4. If vectorization paid with -O3, use -onv with -O4. If this turns out to give much better results, it tells you that either increased unrolling or the use of memory delays is important for your kernel or data set size. Next you should try -O5, which adds memory disambiguation. If you have been compiling a DAXPY like kernel, you should see a big kick with this setting. It is very possible that -onv will have no effect on your code, as it doesn't on DAXPY when called by LINPACK. By the time you start running both memory disambiguation and vectorization together, you will have gotten to the point where GVS will take quite some time for even small kernels. For each possible memory configuration scheduled by disambiguation, the vectorizer will have to try many time consuming versions of your code which grow in time as lu squared. 2.3 GVS switches When you type gvs -h at the command line, the following text appears on your screen - it summarizes the GVS command line switches. gvs V1.16 Copyright (C) 1992-1996 DaLSoft, Inc. All rights reserved.
Usage:
gvs -i fn [-o fn] [-gs #] [-bs #] [-os #] [-ls #] [-ns] [-na] [-nc]
[-nm] [-ms #] [-so] [-nd] [-ea] [-v] [-lu #]
[-fo] [-slct] [-q] [-lb] [-fs] [-sa] [-nsa] [-bf #] [-target #] [-h]
gvs OPTIONS
-bs # - block size -os # - overlap size
-ns - no static -so - static only
-nc - no register allocation -na - no art substitution
-nm - no memory optimizations -ms # - memory optimization depth
-nd - no dualization -ea - only exact substitutions
-lu # - loop unrolling -v - vectorization
-fo - full optimization -q - quick optimization
-gs # - global block size -slct - select optimization
-lb - keep source line boundaries -bf # - be frugal
-ls # - linear scheduling size -fs - fast scheduling
-sa - split accumulators if '-ea' -nsa - no split accumulators
-add31 - addresses up to 31 bits -target {21064|21164} - CPU to optimize for
DEFAULTS: -bs 1 -os 0 -ls 1 -ms 1 -gs 100 -lu 1 -target 21164
We have moved many of the most important optimizations from the NDP front end into GVS, which also consumes the bulk of the time spent performing optimizations due to the cost of scheduling. For most compilations, if you use -OLM and -onrc with the front end (see the last section) you will pass code through to GVS that is ready to be thoroughly optimized and scheduled. In addition, GVS gets activated with a number of different switch settings when you use the global optimization switches -O .._O5, which are also discussed in the last section. However, to do a really excellent job with your code you will have to break the primitive routines out of your code and optimize them one at a time, as the optimizations performed by GVS are sensitive to issues like the size of data sets and the size of the critical loops being unrolled. 2.3.1 Introduction to GVS Switches We will now explain what each of the GVS switch settings control. To help rationalize this process, we have divided the explanation into three sections, preliminary optimizations, scheduling and loop body expanders. If you were using GVS as a stand alone utility, you would pass it the switches described above in the GVS command line. However in normal operations GVS receives its switches from either the driver command line (see section 2.2) using global optimization switches or pass through switches or from a command file or the source itself as metacommands. The last two methods are described at the end of this section. 2.3.2 Preliminary static optimizations The raw output of typical compilers contain many extra variables and expressions which get generated as a result of the fact that intermediate forms break the problem into tiny pieces, each of which usually generates one or more temporary variables. Fortran complicates this issue by associating each variable with an address in addition to a value - raw Fortran code requires each variable to be dereferrenced before it can be used. One of the jobs of good compilers is to get as many variables stored in registers as possible. Part of this process requires the program to be simplified, which means that the extra temporaries generated by the compiler have to get reduced in number. These include addresses which get generated for each usage of a variable but never change. Caching frequently used addresses into registers, such as the addresses of arrays, variables and structures, is one of the important static transformations done at the beginning of a GVS optimization. These reductions are referred to as static because of the fact that they can be performed at compile time. GVS also does dynamic memory optimizations, which require that it generate code which evaluate how arrays overlap and generate alternate loops which get activated at runtime in a dynamic manner. GVS also performs a number of well known reductions that transform long running computations into faster ones. These are often referred to as strength reductions (i.e., x**2 becomes x*x) or "art substitutions." One of the things this section does is rearrange expressions so that the number of temporaries needed to perform an evaluation is reduced. These transformations do not always maintain the order of a computation. If you use the exact evaluation (-ea) switch the order of the calculation does not get changed, but the speed of your code may suffer. 2.3.2.1 statics only: -so,-ns and -so,-fo When you do not want any scheduling to take place, you use the -so (statics only) command. What you are left with by eliminating "scheduling" is the preliminary static optimizations discussed in the last section. These can be turned off with no statics -ns. However if you request statics only you will also get art substitutions, which can be turned off with -na. The word statics here is a bit of a misnomer. It was used originally to indicate that dynamic memory transforms would not get done. For many of the primitive static optimizations there are simple and advanced versions of each heuristic. The advanced versions do more of the same, but take longer. When you pair -so with -fo you let each of these heuristics run its full course, but still quit when you get to the scheduling stage. The net effect is you get a more thorough clean up pass. The reality is that since you have already run the NDP front end, which includes a rather full featured optimization section, what comes out can be rather good scalar code, depending on which NDP front end switch you chose. GVS performs the -so -fo pass very quickly, and it should be the minimum clean up you consider using if you want to understand the code output from ABE. This code could even be "too good" for symbolic debugging, depending on which front end switch you used with the front end. 2.3.3. Switches that affect scheduling: -fo, -fs, -ls, -bs, -os, -gs, -lb From a formal perspective, scheduling can be thought of as a sorting process, in which a list of instructions gets reordered in such a manner that the execution speed of the resulting program is improved. For a program to increase in speed by the simple act of swapping instructions, there must be something else going on that is having an effect on speed. The "thing" that is not going on until scheduling takes place is the pipelined execution of instructions. Scheduling is a non-intuitive process that can result in an order of magnitude increase in speed if done properly. In the case of a 21164, the scheduling process in conjunction with transformations that make it possible for scheduling to improve executions speed, can convert a program which is executing one numeric operation every four cycles into one which does a pair of operations per cycle. The formal process of scheduling involves examining the phase space of legal programs to find the one which minimizes the number of processor stalls. The practical technique uses heuristics which first convert your program into one for which scheduling will produce meaningful results (i.e., loop body expansion), followed by a preliminary scheduling pass which organizes the code based on a future resource needs basis followed by a final pass which tries to improve things by searching for the best possible program near that produced by the first scheduling heuristic. The first scheduling heuristic used prioritizes the instructions laid down based on the number of stalls each will eliminate in the future. It plays a very important role in the scheduling process that corresponds to laying down a foundation for a building. Scheduling is very much like "routing," the process used to place traces on a printed circuit card or inside a chip. The first phase of routing is called "lay out," If it is not done correctly not only will it be impossible to complete the circuit, but it might not even be possible to build a circuit that works, as the distance between devices determines how many picoseconds it takes signals to propagate between devices. The most common heuristics used after lay out are referred to as "ripup and retry" or "push and shove" - names which describe what these passes do. The clean up scheduling passes that we follow our first scheduling pass with easily have been given these same names. 2.3.3.1 Preliminary and linear scheduling: -ls # The goal of scheduling is to search the space of legal representations of a block of code, to find the permuted block which executes most quickly. The formal process used by GVS starts with a classic scheduling heuristic which results in code that minimizes downstream resource conflicts. The preliminary heuristic is not guaranteed to work, in fact none of the techniques used to schedule code are guaranteed to work, but they do, most of the time. To be able to predict stalls GVS must contain a set of rules which describe when they occur. For the 21164 it is known that fetches from the floating point register file take 2 cycles, fetches from the Dcache take three cycles, from the Scache take 10 cycles, that adds and multiplies each take four cycles to complete but can be started on every cycle, etc. etc. There are dozens of such rules that need to get applied during scheduling. The preliminary scheduling heuristic gets applied to all the lines of code in a block. Next, a second order smoothing heuristic is optionally applied. By default, the preliminary scheduler is run anytime -so is not used. Looking above we discover that three other scheduling switches also get used by default: -ls, -bs and -os. The first of these, -ls, controls a heuristic we call linear scheduling, which is the default smoothing pass that follows preliminary scheduling. Anytime -so is not used, you get a combination of the preliminary heuristic and linear scheduling (unless -fs is used). Linear scheduling can be thought of as a second order technique designed to make small fixes in a block. It is similar to the routing technique rip up and retry, although we have also used it as a primary technique in the past. When used as the primary scheduler it worked fine up to loops unrolled by 8. At this point we started to hit what are best described as scalar speed bumps. In linear scheduling every instruction in a scheduled block that still has a stall (some stalls can not be corrected) gets examined. The block is searched for instructions which when swapped with the stalled instruction, will eliminate the stall. This often creates new stalls. The process gets repeated for the newly stalled instruction. If you use -ls 4, the process will get repeated 4 times for every stalled instruction. This routine takes time that is proportional to the number of swaps we are willing to use to use to reduce the number of stalls, which is where it gets the name "linear." 2.3.3.2 Combinatorial scheduling: -bs # and -os # The formal scheduling problem is one for which it can not be proven whether maxima exist or not and whether there is any best technique. The early versions of GVS used a costly combinatorial method followed by linear scheduling. The combinatorial method is controlled by the bs,# and os,# switches. This technique attempts to do a fine grained search of all the legal ways to execute a block most quickly using a search radius about each instruction. The first parameter sets the size of a scheduling window, which essentially states how far in advance of the current point we will consider swapping new instructions into the scheduling window. The second parameter sets the size of a window which states how many of the instructions that we have already placed we will allow to be ripped up in the course of examining new sequences. The current default values of -bs 1 and -os 0 essentially turn off combinatorial scheduling. Combinatorial scheduling behaves like a zone refiner. The window moves along the list accumulating instructions as it goes, and putting them back down when it can. When combinatorial scheduling was our primary technique, the typical values which resulted in good results were -bs,4 and -os,3. The technique is called combinatorial because at each point we have to examine N! possible ways to lay the next instruction down, where N = 7 (i.e., about 6,000 possible cases). As it turns out, it doesn't take all that much time to examine 6,000 cases on an Alpha. However, when you increase bs to 5 and set os to 3 the number of cases balloons up to 50,000. Even if you only have to perform this task for a few hundred or a few thousand lines of assembly language, things aren't all that bad, but when you add the other things that come out of the loop body expansion phase, the costs can grow quickly. Our experience with this algorithm on the 21064 was that it could take 300 seconds to find an 80% solution for a kernel like DAXPY or DDOT, an hour to find a 95% solution and 24 hours to find a 98% solution. We found that it worked best when it was followed by linear scheduling with ls set to 4. We also discovered by accident that linear scheduling itself worked pretty well when we used it with a very large parameter like 999, which makes it possible to keep on making swaps until there are none left to try. The only reason we have left combinatorial scheduling in GVS is that we think there are some large scalar problems which it might be better at than the current combination of heuristics, which is designed to solve problems which contain vectors. In addition, it is the kind of brute force technique which can find solutions when all else fails, simply because it can be set up to make as fine grained a search as is desired. In addition, the classic heuristic we use to lay down our fist set of instructions may not work for all types of codes, since it can not be proven to work for any. So it is handy to hold onto a technique which uses a completely different approach to the problem. After having completed a survey of the normal switches to find the best, you might want to give combinatorial scheduling a try. If the kernel is not too large try -bs,5, -os,4 and -ls, 4. 2.3.3.3 Fast scheduling: -fs Fast scheduling (-fs) is accomplished by doing the default preliminary scheduling and then not following up with a linear smoothing pass. Sometimes the results of -fs will be better than the default, which should always be better and never worse. This is difficult to understand. When it happens we can only presume that the machine does not duplicate exactly the model we use to estimate the cost of stalled instructions. Fast scheduling is also faster because it limits several slow running heuristics in GVS to a single pass. 2.3.3.4 Full optimizations: -fo, -nd,-sa, -nsa -target 21064|21164 Full optimizations, (-fo), are part of what make scheduling a reality, even though they are not by definition a part of scheduling. As a practical problem, instruction scheduling falls into two phases: changing the core program so that you are more likely to find interesting code for which scheduling will have a big pay off and the formal algorithms used to find the best code once a reasonable block of instructions has been created. GVS employs a number of art substitutions and transformations which end up promoting the creation of schedulable code. Many of these transformations get turned on by -fo, and began life in PPS-860. One of the algorithms developed put the i860 into its dual instruction mode, in which it simultaneously issued integer and floating point instructions. Another, dualization, overlapped multiples and adds or adds and adds, in such a manner that pipelined operation could be achieved. These transformations carried over to the Alpha 21064, pretty much unchanged. Dualization on the 21064 attempts to create programs which can issue two instructions per clock just like the i860. On the 21164, dualization builds sequences which will issue four instructions per clock. You can turn off dualization using the -nd switch. GVS determines the number of instructions to issue using the -target switch. The primary goal of PPS-860 was to line up adds and multiplies so that the pipelined versions of these instructions could be used. Using primitive operations that ran up to a factor of 6 faster could speed up codes that were not considered to be vectorizable by a factor of 2. When fed a dot product, this technique resulted in better code, but code not nearly as good as the best hand coded dot products. In particular, the code produced by consolidating adds and multiplies within a block would always exit a block in such a manner that the block just performed the identical function as the original block. In the case of a dot product this resulted in code in which the variable used as an accumulator had to be brought up to date within the block. What we discovered about the 21064 is that the adder tree produced by GVS for a dot product did not add any overhead, partly because adds took 6 cycles. However on the 21164, the 4 cycle add now meant that we had to rethink this strategy. One of the things we did to improve dot products was introduce extra accumulators for loops that had been re-rolled (i.e., were not unrolled at the source level). To help this strategy along we summed these accumulators up outside of the inner loop that generated them The split accumulator heuristic gets turned on by -fo and looks for particular situations in which consolidation of structures like adder trees can be taken care of outside of the inner loop which generates them. What results is lower level code for dot products that resemble the best unrolled code produced by pre-processors like KAP from ordinary Fortran or C loops. When we introduced split accumulators we did not think that situations could arise in which this optimization could slow up kernels that did operations like dot products. However, like everything else with the 21164, we discovered situations where it sped things up and situations where it slowed things up. In particular, when we used it to compile a loop which had 8 source accumulators it worked fine until we unrolled the code by 2. At this point the code lost about 20% of its throughput. We decided that splitting off accumulators was more valuable than it was harmful and added a switch to turn it off (-nsa) in unusual situations where it had a deleterious effect. In addition, the switch -ea (exact arithmetic) turns off split accumulators, but we found there were situations where we wanted -ea and split accumulators. The switch -sa forces split accumulators even if exact arithmetic is requested. 2.3.4 Block size controls: -gs and -lb The treatment of code blocks can be crucial to scheduling. It is possible for compilers to create code blocks that contain 100 or more Alpha instructions, which is the value that the global block size has been set to to keep scheduling time down. In situations where block sizes larger than 100 instructions arise, -gs can be increased. The -lb switch gets used to speed up scheduling by artificially breaking code into smaller blocks each of which contains the assembly language that corresponds to a single source line. To use it you need to turn on line number information in the front end using the -X370 switch (see the -O switch in the compiler driver section). This switch was mainly invented to speed up GVS. Breaking code into smaller chunks speeds up scheduling and many other heuristics used by GVS. However, the concept of a loop broken into small pieces runs contrary to everything else we do with GVS. 2.3.5 Loop body expanders For a pipelined numeric unit to execute efficiently, it must keep its pipelines full for some period of time that is greater than the length of the pipeline. This rule probably dates back to Henry Ford who discovered that it paid to run an assembly line for more than 8 hours at a time. It turns out that an Alpha instruction is in many ways just like an Iceberg. In the case of the 21164, the part of the pipeline which shows above the water line is 4 cycles deep. However, there is another 7 cycles that is hidden and gets consumed by the processor to fetch instructions, decode instructions and figure out where arguments are coming from. The actual operation itself takes just 4 cycles and it is portion of the operation seen by the scheduler. Altogether, a typical Alpha floating point instruction will spend 11 cycles passing through the processors execution pipeline. What makes the Alpha so fast is that it can processes the bits and pieces of 40 instructions at the same time. However, for the pipeline to be efficient, it must remain full more than 50% of the time, which means it must execute at least twice the number of instructions contained by 11 cycles (44) before jumping to a location which forces it to flush its pipes. What this means is that loops which perform 20 or more cycles worth of instructions (i.e., perform as many as 80 operations) are required to produce really efficient code. It is not unusual for an efficient Alpha inner loop to contain 40 to 80 floating point instructions. These large expanses of inline instructions look to the CPU like a drag strip. To build a race course with long straight-aways we start off using the 25 year old IBM Fortran trick of unrolling loops. However, this is not the only process that grows the size of our code. For example, we can not always be sure when arrays overlap. If we want to be able to swap loads and stores in the scheduler (which turns out to be a crucial thing to do) we have to be sure that the addresses referred to by the load or store just swapped do not refer to the same address. If they do, we have just changed the meaning of the computation. This is difficult but not impossible to do, even for C. We handle the process using a technique that we call memory disambiguation, which instead of creating a single scheduled loop, builds several - each corresponding to a different array layout case. The determination of which loop to run gets made at runtime by code built into the program. Memory disambiguation adds time to the scheduling process, because it can easily turn one loop into five or six. Finally, we add vectorization. This takes an unrolled loop and examines all the different ways the loop can be folded back on itself. In the end, the version of the loop which results in the best code gets chosen, but not after many different cases get scheduled in search of the best loop organization. This process will double the amount of code required to write to execute the loop, even though the loop itself will not increase in size over that created for it by the loop unroller. All three of these techniques add code to the loop and can tremendously increase the size of what often starts out as a single line loop in a higher level program. Without these techniques, you can lose a factor of 10 or more in speed. Apply them to loops that are unimportant, and your program will quickly bury itself in red tape. Vector programs live or die by how well you manage the loop body expansion process. 2.3.5.1 Loop unrolling: -lu,# and -bf,# Loop unrolling (-lu #) is the starting point of loop body expansion - without it there is no point in scheduling critical vector routines, unless they have already been unrolled at the source level. The lu # switch tells GVS to unroll all loops # times. As we will see with dot products, source level unrolling can be used to change the structure of loops, which also affects performance. However, even loops which have been unrolled at the source level to control things like the number of registers being made available for the load chain, still require some amount of unrolling to be done by GVS. It is possible that large scalar loops that contain dozens of numeric instructions have enough meat on them to get by without unrolling, but even here loop unrolling can often pay off. As a rule, most compilers which support unrolling do a better job unrolling code themselves than they do generating code from sources that have been unrolled. This is true of the Alpha except where the unrolled source adds critical structural elements to the code, like accumulators. One of the big problems in unrolling comes when you attempt to unroll source that is already unrolled. In that situation, the code that results will often encounter resource restrictions like overloaded registers which choke performance. Over rolled code can also take forever to compile. To avoid this problem, and still make it possible to get reasonable results blindly unrolling source level primitives that are already unrolled, we invented the "be frugal" switch -bf #. This switch tells GVS to avoid "over-rolling." If a loop has more than # cycles, be frugal backs off on the unrolling until it has fewer than # cycles. When be frugal discovers an over-rolled loop and the verbose (-v) driver switch is used, GVS warns the user when it has to reduce the unrolling factor requested by lu #. Be frugal is actually a modifier of -lu #, which sets the initial unrolling factor to be used. Values of 8 and 9 for lu seem to do a good job for many 21164 codes that have not been unrolled. Smaller unrolling factors in the range of 4 to 5 should do better for codes that have been unrolled at the source level to a depth of 4. Read the vectorization section which discusses what we discovered unrolling and vectorizing a dot product kernel. There is a runtime cost to loop unrolling, and that is the cost of clean up code. Every loop that is unrolled requires that it get followed by a scalar loop which cleans up the extra iterations not performed in the unrolled loop. This clean up code usually is a straight scalar loop which runs much more slowly than the pipelined code found in the main loop itself. Since there can be a factor of 10 speed difference between the good and bad code, Amdahl's law comes into play, which states that in situations where code can be vectorized, the scalar portion of a program may start to become the bottleneck. In a situation where vectorized (i.e., pipelined) code is running a factor of 10 faster than scalar code, the time that it takes to execute a straight section that does 100 operations is equal to the time it takes to do 10 operations in a scalar clean up loop. If we were executing a loop which was unrolled by 10 and were asked to perform 109 iterations, the first 100 iterations of our effective loop would take the same time as the last 9. The bottom line here is that if the problem you are executing is always going to perform a constant small number of iterations, it will pay to make your unrolling factor a multiple of the number of iterations you need to perform. Put another way, the unrolling factor that works best for a particular benchmark may be a function of the size of the data sets it is processing. 2.3.5.2 Memory disambiguation: - ms #, -nm Memory disambiguation is a problem with most languages, but especially with C. In C it is impossible for the compiler to figure out if two arrays overlap, because array sizes and locations can be dynamic when arrays are referenced using pointers (which is the only way arrays are referenced in any language). The problem in C is often referred to as memory aliasing. If the compiler's scheduler can not determine that the elements of two arrays are stored at the same address, then it can not swap the stores and loads of these arrays. Freezing stores and loads effectively freezes the numeric operations attached to them, eliminating the possibility of getting the CPU to run in pipelined mode. The key to pipelining is often the ability to move some of the numeric operations of one iteration in an unrolled loop past those ahead of it. Memory disambiguation is a GVS heuristic initiated by the -ms # switch. The -nm switch turns it off. It uses set theory to create constraints which can be evaluated at runtime to determine if one or more arrays are overlapped. If there are a pair of arrays that can overlap # should be set to 1. If there are three (very rare) it should be set to 2. At runtime the code generated by this determines if an address conflict exists between a pair of references and if not, runs the code which follows it. This gets followed by another constraint condition and its code up to the point where it becomes impossible to swap loads and stores. The last section of code to be produced runs essentially scalar bound. Memory disambiguation consumes a tremendous amount of time when combined with optimizations like vectorization. However, it is not very costly if loops are not heavily unrolled and vectorization is not applied. However, without it , vector problems which do frequent loads and stores will remain scalar bound. In the case of the C version of LINPACK, memory disambiguation takes us from 40 megaflops (running on a 300 MHz 21164) up 104 megaflops. If we examine the three most well known LINPACK primitives, (which are referred to as BLA's or Basic Linear Algorithms) we can see that only DAXPY suffers from this problem: DAXPY y(i) = y(i) + da*x(i) DSCAL y(i) = da*y(i) DDOT DOT = SUM(x(i)*x(i)) For every pair of operations in DAXPY the computer must do two loads and one store. Without being able to swap loads and stores, there is no point in scheduling DAXPY, and without scheduling we won't be able to overlap adds and multiplies, which is a requirement if we want to start and add and a multiply on every cycle. However, in the case of dot products, where stores are done very rarely, disambiguation need not be applied. 2.3.5.3 Memory delay: -md # The Memory delay switch (-md ,#), increases the number of clocks that the GVS emulator uses to model register loads from memory (cache). By default the emulator assumes that all fetches are being made from the Dcache and take 3 cycles to complete. This code is often not right for code executing out of the Scache which has a 10 cycle latency. When you use the -tune ev5 switch on the DEC UNIX compilers, in addition to getting quad issue (which you get with GVS unless you use the -target -21064 switch) the other thing that DEC appears to do is add 7 cycles to each load. The fact that the code is sensitive to the assumed latency of loads also means that critical primitives might have to get analyzed based on the size of the data sets they are processing if the ultimate goal is to achieve the best performance. Carried to its logical conclusion, routines like the BLA's might have get broken into separate sections based on the size of the arrays being processed. In the case of the i860XP we discovered similar situations with respect to alignment of arrays in which it paid to independently handle the two possible alignment cases we saw. The Alpha actually suffers from the same problem or worse, but we have not had the time to address these sorts of second order issues, which would double the amount of scheduling we were doing. It takes more than scheduling to bridge the latency gap between the Bcache and the CPU. The only way you can bridge the large latencies associated with a load from the Bcache or memory is with a pipelined loader like that used by the i860, but none exists for the Alpha. The best you can do is use very large unrolling factors which end up creating load chains that act like conveyor belts and diminish the effective 30 cycle Bcache latency by the number of cycles that operands stay in the registers used in the chain. In some instances it might be possible for the distances between loads and their uses to move as many as 15 cycles apart. For the operand conveyor belts to function as planned the scheduler has be informed of the latency you want to associate with loads. There is another tool that can be used to reduce cache and memory latency that will become more important with the ev6, and that is the fact that it is possible to issue dummy loads which cause the CPU to prefetch memory into the Dcache. This feature will be more useful in the future because the number of outstanding fetches that will be allowed at one time will grow from 2 to 6. In our work with LINPACK we discovered that a combination of moderately large unrolling factors and a memory delay of 7 to 9 cycles ended up providing us with the best speeds. For the memory delay to have an effect, you need to be able to hold the operands in registers for a significant amount of time, which gets accomplished by setting lu to 8 or 9 in the case of LINPACK. Example - linpack.c One of the reasons that LINPACK is no longer as important as it once was is that it is the most memory intensive mark in the repertoire - a single line does a pair of floating point operations and 24 bytes of I/O. The algorithms used by LAPACK, can get the same work done with much less I/O effort. Knowing that LINPACK is very memory bound and uses arrays that are likely not to fit into cache tells us that to get the best results we are going to have use the memory delay switch (-md) along with loop unrolling (-lu), both of which are discussed in the next section. We also know that LINPACK does both loads and stores of y(i), which means one has to worry about overlaps between x(i) and y(i). Unless we use the -ms memory disambiguation switch the schduler is not going to be able to swap loads and stores - one of its key features. To find out which combination of loop unrolling and memory delay produces the best result we tried 100+ combinations of these two switches, varying the loop unrolling factor from 1 to 12 and using _____________________________________________________________________________________
mx linmain.s daxpyc.c -O3 -DDP -DROLL sec_micro.o -v -Wg,-fo,-ls,1,-lu,11,-ms,1,-md,9
The typical compilation line used to produce the results in Table 2 is shown above. Everything but daxpy has already been compiled to assembly form. The switches -DDP and -DROLL activate macros which turn on double precision and source which has not been unrolled: LINPACK always reports a pair of results. For example for md =1 and lu =1 LINPACK reports 38 and 38 megaflops for N = 100 and array dimension of 200 and 201. The peak throughput of 104/104 is obtained for md=8 and lu =9.
md x 1 4 5 6 7 8 9 10 11 12 time lu x seconds 1 38/38 39/39 39/39 .7 2 51/52 59/59 59/59 2 3 68/69 61/61 61/61 5 4 75/76 80/80 80/80 9 5 86/87 87/87 86/86 12 6 88/88 86/86 95/95 97/97 90/90 97/98 97/98 98/98 98/98 97.5 17 7 83/83 98/99 96/96 94/94 96/96 93/94 100/100 99/100 99/100 22 8 91/92 92/93 101/101 93/93 99/100 102/102 95/95 101/102 102/102 101/102 32 9 92/93 96/97 100/100 93/93 100/101 104/104 98/99 103/103 102/102 102/102 45 10 87/88 92/93 91/92 88/89 90/90 85/85 90/90 92/92 94/94 92/93 55
Table 2 LINPACK performance in megaflops vs md and lu @ 300 MHz _____________________________________________________________________________________ memory delays of 0 to 10 cycles. In the end we found two "sweet spots" which gave almost identical excellent results. We also varied other GVS switches and discovered that while we needed at least one level of memory disambiguation, vectorization did not pay. LINPACK performance for small arrays (100x100) scales as the speed of the Scache. It hits peak speeds for loops which are large enough to hide the 10 cycle latency of the Scache and which are scheduled so that loads preceed their usage by as many as 10 to 12 cycles. Table 2 shows the results of compiling linpack.c with NDP C|C++. The best results come for loops unrolled by 8 to 9 with memory delays of 7 to 10 cycles - these results are printed bold. The wrong settings on some of these switches can easily make a factor of 2 reduction in LINPACK throughput. The line used to compile linpack.c is at the top of the table. The time to schedule daxpyc.c in seconds is given in the right most column of Table 2. The scheduling time scales as the square of the loop unrolling factor. 2.3.5.4 Vectorization: -v This process is also called software pipelining and the GVS switch -v which gets used to turn it on. The GVS -v switch should not be confused with the -v verbose switch used on the driver command line to get it to emit extra information about the compilation process. Vectorization is always the last GVS optimization that you want to apply, because it can dramatically increase compilation times. For starters, vectorization appears to have its biggest effect when the kernels being optimized are operating at at least half of their full throughput, and that means out of the Dcache or Scache. Even then, it may not speed things up. When it does it can make close to a 100% increase in speed, but it can also increase compile times by a factor of 10. For the i860 we discovered that after we got the CPU running in pipelined mode, there were preferred places where we could place loops. For example, hand coding the complex version of DAXPY (ZAXPY) involved writing down the first 24 complex dot products while looking for a place to set up a loop. Complex operations do significantly more work than their real cousins, which means you have to work your way further down the chain of operations before you find a place to set up a loop. We chose to place our loop between iterations 10 and 18. The 8 complex instructions in the inner loop thus created had the nice property that the loads in the top half perfectly fed the bottom and vice versa. The same technique works for the Alpha, with the exception that the size of the Alpha's unrolled loops can be much larger than the ones we used on the i860. This results in the need for a formal method of finding where to fold a loop on itself in such a way that we maximize the efficiency of the register load store operations that turn out to be the defining characteristics of most RISC loops.
lu 1 lu 1,v delta v lu 2 lu 2,v delta v lu 3 lu 3,v delta v
benquik1.f 115 208 80% 150 205 36% 186 260 39%
benquik3.f 236 439 86% 323 351 09% 372 560 51%
benquik4.f 292 491 68% 384 683 77% 409 491 21%
benquik6.f 345 585 69% 472 512 09% 491 534 07%
benquik7.f 390 585 50% 512 512 00% 491 534 07%
benquik8.f 480 611 27% 579 727 25% 532 602 13%
benquik9.f 410 534 30% 558 560 0.3% 473 534 13%
benquik10.f 423 512 21% 472 535 13% 361 534 48%
benquik11.f 430 455 06% 491 491 0.0% 384 439 14%
benquik12.f 455 472 04% 455 472 04% 351 396 13%
Table 3 dot product throughput in megaflops vs. unrolling factor and vectorization @ 433 MHz The vectorization technique that we developed for the i860 to automatically fold an unrolled loop on itself was adopted to the Alpha. In the case of LINPACK, we saw no improvement, even though LINPACK does both loads and stores and you would think that a technique which improved things on the i860 would do the same on the Alpha. However, in the case of dot products we saw a improvements that ranged from 4 to 86%. The dot product results for a 433 MHz 21164 are listed in Table 3 along with the percentage increase due to vectorization: All these runs were made for data sets which fit exactly into the Alpha's 8 Kbyte Dcache. The first column was obtained unrolling all 10 versions of a program called benquik.f with an unrolling factor of 1. The second column used the same level of unrolling, but added vectorization. It is quite clear that the largest benefits arise in situations where the size of the loops is small. However there are other factors that also play a role. For example, lets examine two loops which are effectively unrolled by 8. The suffix on each Fortran source file describes the number of dot products formed in the inner most loop along with the number of accumulators used. The two versions of the compiled program in which we have a total of 8 dot products in the inner loop are benquik4.f unrolled by 2 (lu 2) and benquik8.f unrolled by 1 (lu 1). These two marks start out at 384 and 480 megaflops respectively. When we apply vectorization they jump to 683 and 611 megaflops. The increase in performance due to vectorization we has been placed in the column labeled with a delta v. The largest increase in performance is 86%. For the case of benquik4.f, the increase was 68%, but the low point started off at 291 megaflops. However, for the same effective amount of loop unrolling the vectorization of benquik8.f only improved 27%. What we can see looking at Table 3 is that in general the smaller the size of the loop, the more effective vectorization appears to be. However, when we compare benquik4.f unrolled by 2 with benquik8.f unrolled by 1, we see something else. Earlier we described what happened when we discussed the results in Table 3 with one of the gurus at DEC who wrote dot products for a living. He predicted that we should get our best results with code that used four accumulators, which turns out to be benquik4.f. In fact, he was correct, when we tried unrolling benquik4.f by 5 we got 834 megaflops for this mark. What we can conclude is that vectorization makes a larger improvement in performance when the potential for a loop to perform well is already an intrinsic feature of the code being vectorized. We now know that the code with 8 accumulators does not do as well when we use unrolling factors larger than 2, but that benquik4.f keeps increasing until you have unrolled its inner loop by 5! We also know that using too many accumulators detracts from dot product performance by removing registers that can be used to build a load chain. It is the load chain which hides the latency of the Scache and Bcache, thereby making it possible for the next heuristic, md, to play a role. We should point out that the routines in Table 3 and the cases plotted in Figure 1 were all compiled with the default memory delay of 3. When this code gets compiled with larger delays it behaves when run with larger data sets. The bottom line on vectorization appears to be it is a viable heuristic when the structure of the loop we are scheduling is a good fit with the hardware. In addition, in situations where it is effective, we get the largest percentage increase in performance with loops that have 2 to 8 elements in them.
We have not described how vectorization works or why it can add so much compile time. The heuristic used essentially re-writes a loop examining all the possible ways it is possible to fold the numeric operations of an unrolled inner loop on itself. For example, for a loop that does three operations, A, B and C, it is possible to create an inner loop yet to be scheduled which performs the operations in three different orders: (A, B, C), ( B, C, A) or (C, A, B). If we discover that the loop C, A, B schedules the best by running the scheduler over it and counting cycles, the inner loop in the program that gives the best performance would start with a pre-amble that did A and B, followed by the inner loop that did (C, A, B) which would get followed by a postamble which did C. Where a normal unrolled loop would execute n times, the inner loop here would be repeated n-1 times, as the preamble and postamble would count for one iteration through the inner loop. Clearly, when there are many operations in an inner loop the number of possible schedulings that must be examined to find the best one can be very large and very time consuming. 2.5 Selective optimization and scheduling The key to not wasting compilation time is not scheduling code which is rarely executed. This requires the user to know which routines or loops need heavy optimization and to direct the compiler to spend most of its time improving these sections only. The assumption here is that you as a user understand your code, and know which routines take most of the time. Once profiled with any profiler you can use this information in any future compilations. We will assume here that you don't have access to a formal profiler. If you do not, you need to figure out using another technique which routines have the heaviest work loads. For most algorithms it is possible to write down a simple equation which will compute the number of adds and multiplies carried to perform the algorithm. If your program is bound by another type of numeric operation, such as square roots, division or exponentiation, then all the scheduling and optimizations in the world are not going to buy much, as these operations either take dozens of cycles each and eliminate the possibility of getting overlapped or pipelined operation or are carried out by calls to library routines which are already optimized. We suggest above some ideas for inlining such routines, which often evaluate polynomials, provided that they have been broken up so that they are not scalar bound. The NDP Alpha family provides four techniques for singling out sections of code and getting the compiler to treat these sections specially. They include command line switches that provide control for the NDP front end and GVS, command files which get read by ABE and insert inline metacommands into its output for GVS, the insertion of GVS metacommands in Fortran or C source and the direct insertion of GVS metacommands into the assembly language files that are fed to GVS. This makes it possible to use a number of alternative compilation techniques. Users can experiment with the gross qualities of the NDP or GVS switches by changing them at the command line. When they discover which switches work best for which routines, they can make up a command file which describes how each of the critical routines in a program is to be optimized by GVS. This makes it possible to vary optimizations in a single module, without having to break it apart and do separate compilations. In some situations, the user may just want to optimize a single loop. This can be done at either the source level or by intercepting the output of ABE before optimizations have been done, and inserting GVS metacommands directly into the ABE assembly language. This last technique is rather tricky, and requires detailed knowledge of how the various parts of the compiler work with each other. However, the source level method can be easily used by any user. 2.5.1 Driver pass through control of GVS and the NDP front end The standard NDP technique for passing parameters through to compilation pre and post processors, is the "-Wx" switch, where the x can be one of a number of different letters depending on the environment and processor. There are two switches relevant to the NDP Alpha family, -Wc and -Wg. These control the compiler front end and GVS, respectively. The syntax used for these switches is a little unique and gets controlled by the compiler drivers mf and mx. It accepts commas as the standard delimiter. The first time a space appears, the string parser takes that as a sign that the control string has terminated, and it returns to parsing the rest of the command line. What this means is that the syntax used to control GVS, for example, is different on the GVS command line than it is in the string that follows -Wg. For example, if we wanted GVS to be invoked by the compiler driver with the switches -fo, -ls 4, -lu 8 and the NDP front end to be invoked with the switches -OLM -onrc, we could accomplish this task with the following line: mf test.f -O2 -Wg,-fo,-ls,4,-lu,8 The -O2 switch description in Table 1 states that it passes -OLM -onrc to the compiler front end and that it passes -fo -fs to GVS. While the compiler front switches agree with what we wanted passed to the front end, the -fo -fs do not. So, we add the -Wg pass through switch to the command line. The moment the compiler driver switch sees this command, it ignores the GVS values that are part of -O2 and replaces them with -fo, -ls 4 and -lu 8. Note, that the commas between the switch variables ls and its parameter (4) and lu and its parameter (8) were needed to avoid the driver from thinking that we had come to the end of the -Wg pass through command. In general, every switch passed using the Wg command must be delimited from its nearest neighbor with a comma, and the parameters of each switch setting that contains one, like ls or lu, must be separated from its parameter with a comma as well. A virtually identical command gets used to pass switches to the compiler front end. Suppose, for example, that we wanted to pass -OL and -X370 to the compiler, and -fo to GVS. Because of the fact that we probably still wanted to pass -O and -RA to ABE, we would use a -O switch to get the ABE switches correct. However, |