<?xml version="1.0" encoding="utf-8"?><vmp:back-matter id="back-matter" xmlns:vmp="http://www.hoise.com/vmp/manual/1.0"><vmp:glossary><vmp:glossary-entry>   <vmp:glossary-term>Architecture</vmp:glossary-term>   <vmp:glossary-definition>    The internal structure of a computer systemor a chip that determines its operational functionality andperformance.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Architectural class</vmp:glossary-term>   <vmp:glossary-definition>    Classification of computer systemsaccording to its architecture: e.g., distributed memory MIMDcomputer, symmetric multi processor (SMP), etc. See this glossaryand section <a href="architecture.html">architecture</a> for thedescription of the various classes.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>ASCI</vmp:glossary-term>   <vmp:glossary-definition>    Accelerated Strategic Computer Initiative. Amassive funding project in the USA concerning research andproduction of high-performance systems. The main motivation is saidto be the management of the USA nuclear stockpile by computationalmodeling instead of actual testing. ASCI has greatly influenced thedevelopment of high-performance systems in a single direction:clusters of SMP systems.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Bank cycle time</vmp:glossary-term>   <vmp:glossary-definition>    The time needed by a (cache-)memory bankto recover from a data access request to that bank. Within the bankcycle time no other requests can be accepted.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Beowulf cluster</vmp:glossary-term>   <vmp:glossary-definition>    Cluster of PCs or workstations with aprivate network to connect them. Initially the name was used fordo-it-yourself collections of PCs mostly connected by Ethernet andrunning Linux to have a cheap alternative for "integrated" parallelmachines. Presently, the definition is wider including high-speedswitched networks, fast RISC-based processors and completevendor-preconfigured rack-mounted systems with either Linux orWindows as an operating system.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Bit-serial</vmp:glossary-term>   <vmp:glossary-definition>    The operation on data on a bit-by-bit basisrather than on byte or 4/8-byte data entities in parallel.Bit-serial operation is done in processor array machines where forsignal and image processing this mode is advantageous.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Cache --- data, instruction</vmp:glossary-term>   <vmp:glossary-definition>    Small, fast memory close tothe CPU that can hold a part of the data or instructions to beprocessed. The primary or level 1 caches are virtually alwayslocated on the same chip as the CPU and are divided in a cache forinstructions and one for data. A secondary or level 2 cache ismostly located off-chip and holds both data and instructions.Caches are put into the system to hide the large latency thatoccurs when data have to be fetched from memory. By loading dataand or instructions into the caches that are likely to be needed,this latency can be significantly reduced.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Capability computing</vmp:glossary-term>   <vmp:glossary-definition>    A type of large-scale computing inwhich one wants to accommodate very large and time consumingcomputing tasks. This requires that parallel machines or clustersare managed with the highest priority for this type of computingpossibly with the consequence that the computing resources in thesystem are not always used with the greatest efficiency.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Capacity computing</vmp:glossary-term>   <vmp:glossary-definition>    A type of large-scale computing inwhich one wants to use the system (cluster) with the highestpossible throughput capacity using the machine resources asefficient as possible. This may have adverse effects on theperformance of individual computing tasks while optimising theoverall usage of the system.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>ccNUMA</vmp:glossary-term>   <vmp:glossary-definition>    Cache Coherent Non-Uniform Memory Access.Machines that support this type of memory access have a physicallydistributed memory but logically it is shared. Because of thephysical difference of the location of the data items, a datarequest may take a varying amount of time depending on the locationof the data. As both the memory parts and the caches in suchsystems are distributed a mechanism is necessary to keep the dataconsistent system-wide. There are various techniques to enforcethis (directory memory, snoopy bus protocol). When one of thesetechniques is implemented the system is said to be cachecoherent.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Clock cycle</vmp:glossary-term>   <vmp:glossary-definition>    Fundamental time unit of a computer. Everyoperation executed by the computer takes at least one and possiblymultiple cycles. Typically, the clock cycle is now in the order ofone to a few nanoseconds.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Clock frequency</vmp:glossary-term>   <vmp:glossary-definition>    Reciproke of the clock cycle: the numberof cycles per second expressed in Hertz (Hz). Typical clockfrequencies nowadays are 400 MHz--1 GHz.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Control processor</vmp:glossary-term>   <vmp:glossary-definition>    The processor in a processor arraymachine that issues the instructions to be executed by all theprocessors in the processor array. Alternatively, the controlprocessor may perform tasks in which the processors in the arrayare not involved, e.g., I/O operations or serial operations.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Crossbar (multistage)</vmp:glossary-term>   <vmp:glossary-definition>    A network in which all input portsare directly connected to all output ports without interferencefrom messages from other ports. In a one-stage crossbar this hasthe effect that for instance all memory modules in a computersystem are directly coupled to all CPUs. This is often the case inmulti-CPU vector systems. In multistage crossbar networks theoutput ports of one crossbar module are coupled with the inputports of other crossbar modules. In this way one is able to buildnetworks that grow with logarithmic complexity, thus reducing thecost of a large network.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Distributed Memory (DM)</vmp:glossary-term>   <vmp:glossary-definition>    Architectural class of machinesin which the memory of the system is distributed over the nodes inthe system. Access to the data in the system has to be done via aninterconnection network that connects the nodes and may be eitherexplicit via message passing or implicit (either using HPF orautomatically in a ccNUMA system).  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Dual core chip</vmp:glossary-term>   <vmp:glossary-definition>    A chip that contains two CPUs and(possibly common) caches. Due to the progression of the integrationlevel more devices can be fitted on a chip. In fact, IBM makes adual core chip: the POWER4 and other vendors may follow in the nearfuture.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>EPIC</vmp:glossary-term>   <vmp:glossary-definition>    Explicitly Parallel Instruction Computing. Thisterm is coined by Intel for its IA-64 chips and the Instruction Setthat is defined for them. EPIC can be seen as Very LargeInstruction Word computing with a few enhancements. The gist of itis that no dynamic instruction scheduling is performed as is donein RISC processors but rather that instruction scheduling andspeculative execution of code is determined beforehand in thecompilation stage of a program. This simplifies the chip designwhile potentially many instructions can be executed inparallel.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Fat tree</vmp:glossary-term>   <vmp:glossary-definition>    A network that has the structure of a binary(quad) tree but that is modified such that near the root theavailable bandwidth is higher than near the leafs. This stems fromthe fact that often a root processor has to gather or broadcastdata to all other processors and without this modificationcontention would occur near the root.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>FPGA</vmp:glossary-term>   <vmp:glossary-definition>    FPGA stands for Field Programmable Gate Array. Thisis an array of logic gates that can be hardware-programmed tofulfill user-specified tasks. In this way one can devise specialpurpose functional units that may be very efficient for thislimited task. As FPGAs can be reconfigured dynamically, be it only100--1,000 times per second, it is theoretically possible tooptimise them for more complex special tasks at speeds that arehigher than what can be achieved with general purposeprocessors.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Functional unit</vmp:glossary-term>   <vmp:glossary-definition>    Unit in a CPU that is responsible forthe execution of a predefined function, e.g., the loading of datain the primary cache or executing a floating-point addition.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Grid --- 2-D, 3-D</vmp:glossary-term>   <vmp:glossary-definition>    A network structure where the nodesare connected in a 2-D or 3-D grid layout. In virtually all casesthe end points of the grid are again connected to the startingpoints thus forming a 2-D or 3-D torus.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>HPF</vmp:glossary-term>   <vmp:glossary-definition>    High Performance Fortran. A compiler and run timesystem that enables to run Fortran programs on a distributed memorysystem as on a shared memory system. Data partition, processorslayout, etc. are specified as comment directives that makes itpossible to run the processor also serially. Present HPF availablecommercially allow only for simple partitioning schemes and allprocessors executing exactly the same code at the same time (ondifferent data, so-called Single Program Multiple Data (SPMD)mode).  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Hypercube</vmp:glossary-term>   <vmp:glossary-definition>    A network with logarithmic complexity whichhas the structure of a generalised cube: to obtain a hypercube ofthe next dimension one doubles the perimeter of the structure andconnect their vertices with the original structure.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Instruction Set Architecture</vmp:glossary-term>   <vmp:glossary-definition>    The set of instructionsthat a CPU is designed to execute. The Instruction Set Architecture(ISA) represents the repertoire of instructions that the designersdetermined to be adequate for a certain CPU. Note that CPUs ofdifferent making may have the same ISA. For instance the AMDprocessors (purposely) implement the Intel IA-32 ISA on a processorwith a different structure.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Memory bank</vmp:glossary-term>   <vmp:glossary-definition>    Part of (cache) memory that is addressedconsecutively in the total set of memory banks, i.e., when dataitem <i>a(n)</i> is stored in bank <i>b</i>, data item<i>a(n+1)</i> is stored in bank <i>b+1</i>. (Cache) memory isdivided in banks to evade the effects of the bank cycle time (seeabove). When data is stored or retrieved consecutively each bankhas enough time to recover before the next request for that bankarrives.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Message passing</vmp:glossary-term>   <vmp:glossary-definition>    Style of parallel programming fordistributed memory systems in which non-local data that is requiredexplicitly must be transported to the processor(s) that need(s) itby appropriate send and receive messages.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>MPI</vmp:glossary-term>   <vmp:glossary-definition>    A message passing library, Message PassingInterface, that implements the message passing style ofprogramming. Presently MPI is the <i>de facto</i> standard for thiskind of programming.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>OpenMP</vmp:glossary-term>   <vmp:glossary-definition>    A shared memory parallel programming model inwhich shared memory systems and SMPs can be operated in parallel.The parallelisation is controlled by comment directives (inFortran) or pragmas (in C and C++), so that the same programs alsocan be run unmodified on serial machines.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Pipelining</vmp:glossary-term>   <vmp:glossary-definition>    Segmenting a functional unit such that it canaccept new operands every cycle while the total execution of theinstruction may take many cycles. The pipeline construction workslike a conveyor belt accepting units until the pipeline is filledand than producing results every cycle.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Processor array</vmp:glossary-term>   <vmp:glossary-definition>    System in which an array (mostly a 2-Dgrid) of simple processors execute its program instructions inlock-step under the control of a Control Processor.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>PVM</vmp:glossary-term>   <vmp:glossary-definition>    Another message passing library that has been widelyused. It was originally developed to run on collections ofworkstations and it can dynamically spawn or delete processesrunning a task. PVM now largely has been replaced by MPI.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Register file</vmp:glossary-term>   <vmp:glossary-definition>    The set of registers in a CPU that areindependent targets for the code to be executed possiblycomplemented with registers that hold constants like 0/1, registersfor renaming intermediary results, and in some cases a separateregister stack to hold function arguments and routine returnaddresses.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>RISC</vmp:glossary-term>   <vmp:glossary-definition>    Reduced Instruction Set Computer. A CPU with itsinstruction set that is sijmpler in comparison with the earlierComplex Instruction Set Computers (CISCs) The instruction set wasreduced to simple instructions that ideally should execute in onecycle.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Shared Memory (SM)</vmp:glossary-term>   <vmp:glossary-definition>    Memory configuration of a computer inwhich all processors have direct access to all the memory in thesystem. Because of technological limitations on shared bandwidthgenerally not more than about 16 processors share a commonmemory.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>SMP</vmp:glossary-term>   <vmp:glossary-definition>    Symmetric Multi-Processing. This term is often usedfor compute nodes with shared memory that are part of a largersystem and where this collection of nodes forms the total system.The nodes may be organised as a ccNUMA system or as a distributedmemory system of which the nodes can be programmed using OpenMPwhile inter-node communication should be done by messagepassing.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>TLB</vmp:glossary-term>   <vmp:glossary-definition>    Translation Look-aside Buffer. A specialised cachethat holds a table of physical addresses as generated from thevirtual addresses used in the program code.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Torus</vmp:glossary-term>   <vmp:glossary-definition>    Structure that results when the end points of agrid are wrapped around to connect to the starting points of thatgrid. This configuration is often used in the interconnectionnetworks of parallel machines either with a 2-D grid or with 3-Dgrid.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>Vector unit (pipe)</vmp:glossary-term>   <vmp:glossary-definition>    A pipelined functional unit that isfed with operands from a vector register and will produce a resultevery cycle (after filling the pipeline) for the complete contentsof the vector register.  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry><vmp:glossary-entry>   <vmp:glossary-term>VLIW processing</vmp:glossary-term>   <vmp:glossary-definition>    Very Large Instruction Word processing.The use of large instruction words to keep many functional unitsbusy in parallel. The scheduling of instructions is done staticallyby the compiler and, as such, requires high quality code generationby that compiler. VLIW processing has been revived in the IA-64chip architecture, there called EPIC (see above).  </vmp:glossary-definition>   <vmp:glossary-reference>      </vmp:glossary-reference>  </vmp:glossary-entry></vmp:glossary> <vmp:acknowledgments><vmp:title>Acknowledgments</vmp:title><p>It is not possible to thank all people that have beencontributing to this overview. Many vendors and people interestedin this project have been so kind to provide me with the vitalinformation or to correct us when necessary. Therefore, we willhave to thank them here collectively but not less heartily fortheir support.</p></vmp:acknowledgments><vmp:references id="references"><vmp:reference-entry id="Amza95">   <vmp:reference-author> C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R.   Rajamony, W. Yu, W. Zwaenepoel   </vmp:reference-author>   <vmp:reference-title>TreadMarks: Shared Memory   Computing on Networks of Workstations</vmp:reference-title>   <vmp:reference-publication>to appear in IEEE Computer   (also:       </vmp:reference-publication>       <vmp:reference-url>http://www.cs.rice.edu/~willy/TreadMarks/papers.htm</vmp:reference-url> </vmp:reference-entry><vmp:reference-entry id="ASCI"> <vmp:reference-author></vmp:reference-author>  <vmp:reference-publication>The ASCI program:</vmp:reference-publication> <vmp:reference-url>http://http://www.llnl.gov/asci/</vmp:reference-url></vmp:reference-entry><vmp:reference-entry id="CaSt98">   <vmp:reference-author>K. Cassirer, B. Steckel   </vmp:reference-author>   <vmp:reference-title>Block-Structured Multigrid   on the Cenju</vmp:reference-title>   <vmp:reference-publication>2<i><sup>nd</sup></i> Cenju Workshop, October 1998, Sankt   Augustin, Germany.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="Cull98">   <vmp:reference-author>D.E. Culler, J.P. Singh, A. Gupta   </vmp:reference-author>   <vmp:reference-title>Parallel Computer  Architecture: A Hardware/Software Approach</vmp:reference-title>   <vmp:reference-publication>Morgan Kaufmann  Publishers Inc., August 1998.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="Dong02">   <vmp:reference-author>J.J. Dongarra   </vmp:reference-author>   <vmp:reference-title>Performance of various computers   using standard linear equations software</vmp:reference-title>   <vmp:reference-publication>Computer Science Technical   Report CS-89-85, Univ. of Tennessee, July 17th,  2002.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="EurB99">   <vmp:reference-author></vmp:reference-author><vmp:reference-publication>Directory with EuroBen results:          </vmp:reference-publication>      <vmp:reference-url>http://www.euroben.nl/results</vmp:reference-url></vmp:reference-entry><vmp:reference-entry id="Flan91">   <vmp:reference-author>P. Flanders   </vmp:reference-author>   <vmp:reference-title>Matrix Multiplication on 'C' series DAPs   </vmp:reference-title>   <vmp:reference-publication>    AMT Document TR40, Jan. 1991.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="Flynn72">   <vmp:reference-author>M.J. Flynn   </vmp:reference-author>   <vmp:reference-title>Some computer organisations and their   effectiveness</vmp:reference-title>   <vmp:reference-publication>IEEE Trans. on Computers, Vol. C-21, 9, (1972)   948--960.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="Geist94">   <vmp:reference-author>A. Geist, A. Beguelin, J. Dongarra, R. Manchek,   W. Jaing, and V. Sunderam   </vmp:reference-author>   <vmp:reference-title>PVM: A Users' Guide and Tutorial for   Networked Parallel Computing</vmp:reference-title>   <vmp:reference-publication>MIT Press, Boston, 1994.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="Gigan01"> <vmp:reference-url>http://www.giganet.com">http://www.giganet.com</vmp:reference-url></vmp:reference-entry><vmp:reference-entry id="Hill97">   <vmp:reference-author>J.M.D. Hill, W. McColl, D.C. Stefanescu, M.W. Goudreau,     K. Lang, S.B. Rao, T. Suel, T. Tsantilas, R. Bisseling</vmp:reference-author>,     <vmp:reference-title>BSPlib: The BSP Programming Library</vmp:reference-title>   <vmp:reference-publication>Technical     report PRG-TR-29-9, Oxford University Computing Laboratory, May     1997. (Compressed Postscript with ANSI C examples, 142K;     Compressed Postscript with Fortran 77 examples, 141K)</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="Hisr2201">    <vmp:reference-url>http://www.hitachi.co.jp/Prod/comp/hpc/eng/sr1.html</vmp:reference-url></vmp:reference-entry><vmp:reference-entry id="Hock88">   <vmp:reference-author>R. W. Hockney, C. R. Jesshope   </vmp:reference-author>   <vmp:reference-title>Parallel   Computers II</vmp:reference-title>   <vmp:reference-publication>Bristol: Adam Hilger, 1987.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="Hori91">   <vmp:reference-author>T. Horie, H. Ishihata, T. Shimizu, S. Kato, S. Inano,   M. Ikesaka   </vmp:reference-author>   <vmp:reference-title>AP1000 architecture and performance of LU   decomposition</vmp:reference-title>   <vmp:reference-publication>Proc. Internat. Symp. on Supercomputing, Fukuoka, Nov.    1991, 46--55.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="HPF93">   <vmp:reference-author>High Performance Fortran Forum   </vmp:reference-author>   <vmp:reference-title>High Performance   Fortran Language Specification</vmp:reference-title>   <vmp:reference-publication>Scientific Programming, <b> 2</b>,   13, (1993) 1--170.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="JaLa90">   <vmp:reference-author>D.V. James, A.T. Laundrie, S. Gjessing, G.S. Sohi</vmp:reference-author>,     <vmp:reference-title>Scalable Coherent Interface</vmp:reference-title>   <vmp:reference-publication>IEEE Computer, <b> 23</b>, 6,   (1990),74--77. See also:    Scalable Coherent Interface:       </vmp:reference-publication>      <vmp:reference-url>http://http://sunrise.scu.edu/</vmp:reference-url></vmp:reference-entry><vmp:reference-entry id="MPI1">   <vmp:reference-author>M. Snir, S. Otto, S. Huss-Lederman, D. Walker,   J. Dongarra</vmp:reference-author>,     <vmp:reference-title>MPI: The Complete Reference Vol. 1,   The MPI Core</vmp:reference-title>   <vmp:reference-publication>MIT Press, Boston, 1998.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="MPI2">   <vmp:reference-author>W. Gropp, S. Huss-Ledermann, A. Lumsdaine, E. Lusk,   B. Nitzberg, W. Saphir, M. Snir</vmp:reference-author>   <vmp:reference-title>MPI: The Complete Reference, Vol. 2,   The MPI Extensions</vmp:reference-title>   <vmp:reference-publication>MIT Press, Boston, 1998.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="Myr00"> <vmp:reference-url>http://www.myrinet.com"</vmp:reference-url></vmp:reference-entry><vmp:reference-entry id="Nagel98">   <vmp:reference-author>W.E. Nagel   </vmp:reference-author>   <vmp:reference-title> Applications on the Cenju: First   Experience with Effective Performance</vmp:reference-title>   <vmp:reference-publication>2<i><sup>nd</sup></i> Cenju Workshop,   October 1998, Sankt Augustin, Germany.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="npb2">   <vmp:reference-author>Web page for the NAS Parallel benchmarks NPB2:         </vmp:reference-author>      <vmp:reference-url>http://science.nas.nasa.gov/Software/NPB/</vmp:reference-url></vmp:reference-entry><vmp:reference-entry id="opmp97">   <vmp:reference-author>OpenMP Forum   </vmp:reference-author>   <vmp:reference-title>Fortran Language Specification, version   1.0</vmp:reference-title>   <vmp:reference-publication>Web page:       </vmp:reference-publication>      <vmp:reference-url>http://www.openmp.org/</vmp:reference-url>  October 1997.</vmp:reference-entry><vmp:reference-entry id="SPECT00">   <vmp:reference-author>D.H.M. Spector</vmp:reference-author>   <vmp:reference-title>Building Unix Clusters</vmp:reference-title>   <vmp:reference-publication>O'Reilly,Sebastopol, CA, USA, July 2000</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="Steen90">   <vmp:reference-author>A.J. van der Steen</vmp:reference-author>    <vmp:reference-title>Exploring VLIW: Benchmark   tests on a Multiflow TRACE 14/300</vmp:reference-title>   <vmp:reference-publication>Academic Computing Centre Utrecht,   Technical Report TR-31, April 1990.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="Steen91">   <vmp:reference-author>A.J. van der Steen   </vmp:reference-author>   <vmp:reference-title>The benchmark of the EuroBen   Group</vmp:reference-title>   <vmp:reference-publication>Parallel Computing <b> 17</b> (1991) 1211--1221.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="Steen93">   <vmp:reference-author>A.J. van der Steen   </vmp:reference-author>   <vmp:reference-title>Benchmark results for the Hitachi    S-3800</vmp:reference-title>   <vmp:reference-publication>Supercomputer, <b> 10</b>, 4/5, (1993) 32--45.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="Steen95">   <vmp:reference-author>A.J. van der Steen, ed. </vmp:reference-author>   <vmp:reference-title>Aspects of computational   science</vmp:reference-title>   <vmp:reference-publication>NCF, The Hague, 1995.</vmp:reference-publication></vmp:reference-entry><vmp:reference-entry id="Steen98">   <vmp:reference-author>A.J. van der Steen   </vmp:reference-author>   <vmp:reference-title>Benchmarking the  Silicon   Graphics Origin2000 System</vmp:reference-title>   <vmp:reference-publication>Technical Report WFI-98-2, Dept. of   Computational Physics, Utrecht University, The Netherlands, May 1998.   The report can be downloaded from:       </vmp:reference-publication>      <vmp:reference-url>http://www.euroben.nl/reports/</vmp:reference-url> </vmp:reference-entry><vmp:reference-entry id="Steen00">   <vmp:reference-author>A.J. van der Steen</vmp:reference-author>   <vmp:reference-title>An evaluation of some Beowulfclusters</vmp:reference-title>   <vmp:reference-publication>     Technical Report WFI-00-07, Utrecht University,Dept. of Computational Physics, December 2000. (Also availablethrough     </vmp:reference-publication>      <vmp:reference-url>http://www.euroben.nl</vmp:reference-url> </vmp:reference-entry><vmp:reference-entry id="Ster99">   <vmp:reference-author>>T.L. Sterling, J. Salmon, D.J. Becker, D.F. Savaresse</vmp:reference-author>   <vmp:reference-title>Howto Build a Beowulf</vmp:reference-title>   <vmp:reference-publication>The MIT Press, Boston, 1999   </vmp:reference-publication>      <vmp:reference-url></vmp:reference-url> </vmp:reference-entry><vmp:reference-entry id="Top500">   <vmp:reference-author>H.W. Meuer, E. Strohmaier, J.J. Dongarra, H.D. Simon</vmp:reference-author>   <vmp:reference-title>Top500 Supercomputer Sites</vmp:reference-title>   <vmp:reference-publication>   18th Edition, June 20,2002,  The report can be downloaded from:   </vmp:reference-publication>      <vmp:reference-url>http://www.netlib.org/benchmark/top500.html</vmp:reference-url> </vmp:reference-entry><vmp:reference-entry id="TFCC">   <vmp:reference-author>Mark Baker (ed.)</vmp:reference-author>   <vmp:reference-title>Cluster Computing White Paper</vmp:reference-title>   <vmp:reference-publication>December 2000, to be downloaded from:    </vmp:reference-publication>      <vmp:reference-url>http://www.dcs.port.ac.uk/~mab/tfcc/WhitePaper</vmp:reference-url> </vmp:reference-entry></vmp:references><vmp:list-of-figures/></vmp:back-matter>