<?xml version="1.0" encoding="utf-8"?><vmp:chapter id="chapter-2" xmlns:vmp="http://www.hoise.com/vmp/manual/1.0"><vmp:title>The main architectural classes</vmp:title><vmp:section id="arch-class-intro">  <vmp:title>The main architectural classes</vmp:title><p>Before going on to the descriptions of the machines themselves,it is important to consider some mechanisms that are or have beenused to increase the performance. The hardware structure or<i>architecture</i> determines to a large extent what thepossibilities and impossibilities are in speeding up a computersystem beyond the performance of a single CPU. Another importantfactor that is considered in combination with the hardware is thecapability of compilers to generate efficient code to be executedon the given hardware platform. In many cases it is hard todistinguish between hardware and software influences and one has tobe careful in the interpretation of results when ascribing certaineffects to hardware or software peculiarities or both. In thischapter we will give most emphasis to the hardware architecture.For a description of machines that can be considered to beclassified as "high-performance" one is referred to <ahref="references.html#Cull98">[5]</a> and <ahref="references.html#Steen95">[28]</a>.</p><p>Since many years the taxonomy of Flynn <ahref="references.html#Flynn72">[9]</a> has proven to be useful forthe classification of high-performance computers. Thisclassification is based on the way of manipulating of instructionand data streams and comprises four main architectural classes. Wewill first briefly sketch these classes and afterwards fill in somedetails when each of the classes is described separately.</p><ul><li><b>SISD</b> machines: These are the conventional systems thatcontain one CPU and hence can accommodate one instruction streamthat is executed serially. Nowadays many large mainframes may havemore than one CPU but each of these execute instruction streamsthat are unrelated. Therefore, such systems still should beregarded as (a couple of) SISD machines acting on different dataspaces. Examples of SISD machines are for instance mostworkstations like those of DEC, Hewlett-Packard, and SunMicrosystems. The definition of SISD machines is given here forcompleteness' sake. We will not discuss this type of machines inthis report.</li><li><b>SIMD</b> machines: Such systems often have a large number ofprocessing units, ranging from 1,024 to 16,384 that all may executethe same instruction on different data in lock-step. So, a singleinstruction manipulates many data items in parallel. Examples ofSIMD machines in this class are the CPP DAP Gamma II and theQuadrics Apemille.</li><li>Another subclass of the SIMD systems are the vectorprocessors.Vectorprocessors act on arrays of similar data rather than onsingle data items using specially structured CPUs. When data can bemanipulated by these vector units, results can be delivered with arate of one, two and --- in special cases --- of three per clockcycle (a clock cycle being defined as the basic internal unit oftime for the system). So, vector processors execute on their datain an almost parallel way but only when executing in vector mode.In this case they are several times faster than when executing inconventional scalar mode. For practical purposes vectorprocessorsare therefore mostly regarded as SIMD machines. An example of sucha system is for instance the NEC SX-6i.</li><li><b>MISD</b> machines: Theoretically in these type of machinesmultiple instructions should act on a single stream of data. As yetno practical machine in this class has been constructed nor aresuch systems easily to conceive. We will disregard them in thefollowing discussions.</li><li><b>MIMD</b>} machines: These machines execute severalinstruction streams in parallel on different data. The differencewith the multi-processor SISD machines mentioned above lies in thefact that the instructions and data are related because theyrepresent different parts of the same task to be executed. So, MIMDsystems may run many sub-tasks in parallel in order to shorten thetime-to-solution for the main task to be executed. There is a largevariety of MIMD systems and especially in this class the Flynntaxonomy proves to be not fully adequate for the classification ofsystems. Systems that behave very differently like a four-processorNEC SX-6 and a thousand processor SGI/Cray T3E fall both in thisclass. In the following we will make another important distinctionbetween classes of systems and treat them accordingly. <ul><li><b>Shared memory systems</b>: Shared memory systems havemultiple CPUs all of which share the same address space. This meansthat the knowledge of where data is stored is of no concern to theuser as there is only one memory accessed by all CPUs on an equalbasis. Shared memory systems can be both SIMD or MIMD. Single-CPUvector processors can be regarded as an example of the former,while the multi-CPU models of these machines are examples of thelatter. We will sometimes use the abbreviations SM-SIMD and SM-MIMDfor the two subclasses.</li><li><b>Distributed memory systems</b>: In this case each CPU hasits own associated memory. The CPUs are connected by some networkand may exchange data between their respective memories whenrequired. In contrast to shared memory machines the user must beaware of the location of the data in the local memories and willhave to move or distribute these data explicitly when needed.Again, distributed memory systems may be either SIMD or MIMD. Thefirst class of SIMD systems mentioned which operate in lock step,all have distributed memories associated to the processors. As wewill see, distributed-memory MIMD systems exhibit a large varietyin the topology of their connecting network. The details of thistopology are largely hidden from the user which is quite helpfulwith respect to portability of applications. For thedistributed-memory systems we will sometimes use DM-SIMD andDM-MIMD to indicate the two subclasses.</li></ul></li></ul><p>As already alluded to, although the difference between shared-and distributed memory machines seems clear cut, this is not alwaysentirely the case from user's point of view. For instance, the lateKendall Square Research systems employed the idea of "virtualshared memory" on a hardware level. Virtual shared memory can alsobe simulated at the programming level: A specification of HighPerformance Fortran (HPF) was published in 1993 <ahref="references.html#HPF93">[16]</a> which by means of compilerdirectives distributes the data over the available processors.Therefore, the system on which HPF is implemented in this case willlook like a shared memory machine to the user. Other vendors ofMassively Parallel Processing systems (sometimes called MPPsystems), like HP and SGI, also are able to support proprietaryvirtual shared-memory programming models due to the fact that thesephysically distributed memory systems are able to address the wholecollective address space. So, for the user such systems have one<em>global address space</em> spanning all of the memory in thesystem. We will say a little more about the structure of suchsystems in the <a href="ccNUMA.html#ccNUMA">ccNUMA</a> section. Inaddition, packages like TreadMarks (<ahref="references.html#Amza95">[1]</a>) provide a virtual sharedmemory environment for networks of workstations.</p><p><i>Distributed processing</i> takes the DM-MIMD concept one stepfurther: instead of many integrated processors in one or severalboxes, workstations, mainframes, etc., are connected by (Gigabit)Ethernet, FDDI, or otherwise and set to work concurrently on tasksin the same program. Conceptually, this is not different fromDM-MIMD computing, but the communication between processors isoften orders of magnitude slower. Many packages to realisedistributed computing are available. Examples of these are PVM(standing for <b>P</b>arallel <b>V</b>irtual <b>M</b>achine) <ahref="references.html#Geist94">[10]</a>, and MPI (<b>M</b>essage<b>P</b>assing <b>I</b>nterface, <ahref="references.html#MPI1">[18]</a>), <ahref="references.html#MPI2">[19]</a>). This style of programming,called the "message passing" model has becomes so much acceptedthat PVM and MPI have been adopted by virtually all major vendorsof distributed-memory MIMD systems and even on shared-memory MIMDsystems for compatibility reasons. In addition there is a tendencyto cluster shared-memory systems, for instance by HiPPI channels,to obtain systems with a very high computational power. E.g., theNEC SX-6, and the Cray SV1ex have this structure. So, within theclustered nodes a shared-memory programming style can be used whilebetween clusters message-passing should be used.</p><p>For SM-MIMD systems we should mention OpenMP <ahref="references.html#Chandra01">[4]</a>, that can be used toparallelise Fortran and C(++) programs by inserting commentdirectives (Fortran 77/90/95) or pragmas (C/C++) into the code.OpenMP has quickly been adopted by the major vendors and has becomea well established standard for shared memory systems.</p></vmp:section><vmp:section id="sm-simd"><vmp:title>Shared-memory SIMD machines</vmp:title><p>This subclass of machines is practically equivalent to thesingle-processor vectorprocessors, although other interestingmachines in this subclass have existed (viz. VLIW machines <ahref="references.html#Steen90">[25]</a>). In the block diagram in<vmp:linkto figure="vecpr.jpg" /> we depict a generic model of avector architecture. <a id="figvecpr" name="figvecpr"></a></p><vmp:figure id="vecpr.jpg" description="Block diagram of a vector processor" /><p>The single-processor vector machine will have only one of thevectorprocessors depicted and the system may even have its scalarfloating-point capability shared with the vector processor (as wasthe case in some <a href="gone.html">Cray systems</a>). It may benoted that the VPU does not show a cache. The majority ofvectorprocessors do not employ a cache anymore. In many cases thevector unit cannot take advantage of it and execution speed mayeven be unfavourably affected because of frequent cacheoverflow.</p><p>Although vectorprocessors have existed that loaded theiroperands directly from memory and stored the results againimmediately in memory (CDC Cyber 205, ETA-10), all present-dayvectorprocessors use vector registers. This usually does not impairthe speed of operations while providing much more flexibility ingathering operands and manipulation with intermediate results.</p><p>Because of the generic nature of Figure <ahref="#figvecpr">1</a> no details of the interconnection betweenthe VPU and the memory are shown. Still, these details are veryimportant for the effective speed of a vector operation: when thebandwidth between memory and the VPU is too small it is notpossible to take full advantage of the VPU because it has to waitfor operands and/or has to wait before it can store results. Whenthe ratio of arithmetic to load/store operations is not high enoughto compensate for such situations, severe performance losses may beincurred.</p><p>The influence of the number of load/store paths for the dyadicvector operation <i>c = a</i> + <i>b</i> (<i>a</i>, <i>b</i>, and<i>c</i> vectors) is depicted in <vmp:linkto figure="loads.jpg" />.</p><vmp:figure id="loads.jpg" description="Schematicdiagram of a vector addition. Case (a) when two load- and one storepipe are available; case (b) when two load/store pipes areavailable." /><p>Because of the high costs of implementing these datapathsbetween memory and the VPU, often compromises are sought and thenumber of systems that have the full required bandwidth (i.e., twoload operations and one store operation at the <i>same</i> time) islimited. In fact, in the vector systems marketed today this highbandwidth thus not occur any longer. Vendors rather rely onadditional caches and other tricks to hide the lack ofbandwidth.</p><p>The VPUs are shown as a single block in Figure <ahref="#figvecpr">1</a>. Yet, again there is a considerablediversity in the structure of VPUs. Every VPU consists of a numberof vector functional units, or "pipes" that fulfill one or severalfunctions in the VPU. Every VPU will have pipes that are designatedto perform memory access functions, thus assuring the timelydelivery of operands to the arithmetic pipes and of storing theresults in memory again. Usually there will be several arithmeticfunctional units for integer/logical arithmetic, for floating-pointaddition, for multiplication and sometimes a combination of both, aso-called compound operation. Division is performed by an iterativeprocedure, table look-up, or a combination of both using the addand multiply pipe. In addition, there will almost always be a maskpipe to enable operation on a selected subset of elements in avector of operands. Lastly, such sets of vector pipes can bereplicated within one VPU (2- up to 16-fold replication areoccurs). Ideally, this will increase the performance per VPU by thesame factor provided the bandwidth to memory is adequate.</p></vmp:section><vmp:section id="dm-mimd"><vmp:title>Distributed-memory MIMDmachines</vmp:title><p>The class of DM-MIMD machines is undoubtly the fastest growingpart in the family of high-performance computers. Although thistype of machines is more difficult to deal with than shared-memorymachines and DM-SIMD machines. The latter type of machines areprocessor-array systems in which the data structures that arecandidates for parallelisation are vectors and multi-dimensionalarrays that are laid out automatically on the processor array bythe system software. For shared-memory systems the datadistribution is completely transparant to the user. This is quitedifferent for DM-MIMD systems where the user has to distribute thedata over the processors and also the data exchange betweenprocessors has to be performed explicitely. The initial reluctanceto use DM-MIMD machines seems to have been decreased. Partly thisis due to the now existing standard for communication software (<ahref="references.html#Geist94">[10,18,19]</a>) and partly because,at least theoretically, this class of systems is able to outperformall other types of machines.</p><p>The advantages of DM-MIMD systems are clear: the bandwidthproblem that haunts shared-memory systems is avoided because thebandwidth scales up automatically with the number of processors.Furthermore, the speed of the memory which is another criticalissue with shared-memory systems (to get a peak performance that iscomparable to that of DM-MIMD systems, the processors of theshared-memory machines should be very fast and the speed of thememory should match it) is less important for the DM-MIMD machines,because more processors can be configured without the aforementioned bandwidth problems.</p><p>Of course, DM-MIMD systems also have their disadvantages: Thecommunication between processors is much slower than in SM-MIMDsystems, and so, the synchronisation overhead in case ofcommunicating tasks is generally orders of magnitude higher than inshared-memory machines. Moreover, the access to data that are notin the local memory belonging to a particular processor have to beobtained from non-local memory (or memories). This is again on mostsystems very slow as compared to local data access. When thestructure of a problem dictates a frequent exchange of data betweenprocessors and/or requires many processor synchronisations, it maywell be that only a very small fraction of the theoretical peakspeed can be obtained. As already mentioned, the data- and taskdecomposition are factors that mostly have to be dealt withexplicitly, which may be far from trivial.</p><p>It will be clear from the paragraph above that also for DM-MIMDmachines both the topology and the speed of the datapaths are ofcrucial importance for the practical usefulness of a system. Again,as in the section on <a href="sm-mimd.html">SM-MIMD systems</a>,the richness of the connection structure has to be balanced againstthe costs. Of the many conceivable interconnection structures onlya few are popular in practice. One of these is the so-calledhypercube topology as depicted in<vmp:linkto figure="netw2.jpg" />.</p><vmp:figure id="netw2.jpg" description="Some often used networks for DM machinetypes" /><p>A nice feature of the hypercube topology is that for a hypercubewith 2<i><sup>d</sup></i> nodes the number of steps to be takenbetween any two nodes is at most <i>d</i>. So, the dimension of thenetwork grows only logarithmically with the number of nodes. Inaddition, theoretically, it is possible to simulate any othertopology on a hypercube: trees, rings, 2-D and 3-D meshes, etc. Inpractice, the exact topology for hypercubes does not matter toomuch anymore because all systems in the market today employ what iscalled "wormhole routing". This means that a message is send from<i>i</i> to node <i>j</i> a header message is sent from <i>i</i> to<i>j</i>, resulting in a direct connection between these nodes. Assoon this connection is established, the data proper is sentthrough this connection without disturbing the operation of theintermediate nodes. Except for a small amount of time in setting upthe connection between nodes, the communication time has becomevirtually independent of the distance between the nodes. Of course,when several messages in a busy network have to compete for thesame paths, waiting times are incurred as in any network that doesnot directly connect any processor to all others and oftenrerouting strategies are employed to circumvent busy links.</p><p>Another cost-effective way to connect a large number ofprocessors is by means of a <i>fat tree</i>. In principle a simpletree structure for a network is sufficient to connect all nodes ina computer system. However, in practice it turns out that near theroot of the tree congestion occurs because of the the concentrationof messages that first have to traverse the higher levels in thetree structure before they can descend again to their target nodes.The fat tree amends this shortcoming by providing more bandwidth(mostly in the form of multiple connections) in the higher levelsof the tree. An example of a fat tree with a bandwidth in thehighest level that is doubled with respect to the lower levels isshown in Figure <a href="#netw2">5 (b)</a>.</p><p>A number of massively parallel DM-MIMD systems seem to favour a2-D or 3-D mesh (torus) structure. The rationale for this seems tobe that most large-scale physical simulations can be mappedefficiently on this topology and that a richer interconnectionstructure hardly pays off. However, some systems maintain (an)additional network(s) besides the mesh to handle certainbottlenecks in data distribution and retrieval <ahref="references.html#Hori91">[15]</a>.</p><p>A large fraction of systems in the DM-MIMD class employcrossbars. For relatively small amounts of processors (in the orderof 64) this may be a direct or 1-stage crossbar, while to connectlarger numbers of nodes multi-stage crossbars are used, i.e., theconnections of a crossbar at level 1 connect to a crossbar at level2, etc., instead of directly to nodes at more remote distances inthe topology. In this way it is possible to connect in the order ofa few thousands of nodes through only a few switching stages. Inaddition to the hypercube structure, other logarithmic complexitynetworks like Butterfly-, &Omega;-, or shuffle-exchangenetworks are often employed in such systems.</p><p>As with SM-MIMD machines, a node may in principle consist of anytype of processor (scalar or vector) for computation or transactionprocessing together with local memory (with or without cache) and,in almost all cases, a separate communication processor with linksto connect the node to its neighbours. Nowadays, the nodeprocessors are mostly off-the-shelf RISC processors sometimesenhanced by vector processors. A problem that is peculiar to theseDM-MIMD systems is the mismatch of communication vs. computationspeed that may occur when the node processors are upgraded, withoutalso speeding up the intercommunication. In some cases this mayresult in turning computational-bound problems intocommunication-bound problems.</p></vmp:section><vmp:section id="ccNUMA"><vmp:title>ccNUMA machines</vmp:title><p>As already mentioned in the introduction, a trend can beobserved to build systems that have a rather small (up to 16)number of RISC processors that are tightly integrated in a cluster,a Symmetric Multi-Processing (SMP) node. The processors in such anode are virtually always connected by a 1-stage crossbar whilethese clusters are connected by a less costly network. Such asystem may look as depicted in <vmp:linkto figure="hm-block.jpg" />.Note that in Figure 6 all CPUs in a cluster are connected to acommon part of the memory.</p><vmp:figure id="hm-block.jpg" description="Block diagram of a system with a'hybrid' network: clusters of four CPUs are connected by acrossbar. The clusters are connected by a less expensive network,e.g., a Butterfly network." /><p>This is similar to the policy mentioned for largevectorprocessor ensembles mentioned above but with the importantdifference that all of the processors can access all of the addressspace if necessary. The most important ways to let the SMP nodesshare their memory are S-COMA <strong>S</strong>imple<strong>C</strong>ache-<strong>O</strong>nly<strong>M</strong>emory <strong>A</strong>rchitecture) and ccNUMA,which stands for <strong>C</strong>ache <strong>C</strong>oherent<strong>N</strong>on-<strong>U</strong>niform<strong>M</strong>emory <strong>A</strong>ccess. Therefore, suchsystems can be considered as SM-MIMD machines. On the other hand,because the memory is physically distributed, it cannot beguaranteed that a data access operation always will be satisfiedwithin the same time. In S-COMA systems the cache hierarchy of thelocal nodes is extended to the memory of the other nodes. So, whendata is required that does not reside in the local node's memory itis retrieved from the memory of the node where it is stored. InccNUMA this concept is further extended in that all memory in thesystem is regarded (and addressed) globally. So, a data item maynot be physically local but logically it belongs to one sharedaddress space. Because the data can be physically dispersed overmany nodes, the access time for different data items may well bedifferent which explains the term non-uniform data access. The term"Cache Coherent" refers to the fact that for all CPUs any variablethat is to be used must have a consistent value. Therefore, is mustbe assured that the caches that provide these variables are alsoconsistent in this respect. There are various ways to ensure thatthe caches of the CPUs are coherent. One is the <em>snoopy busprotocol</em> in which the caches listen in on transport ofvariables to any of the CPUs and update their own copies of thesevariables if they have them. Another way is the <em>directorymemory</em>, a special part of memory which enables to keep trackof the all copies of variables and of their validness.<br />Presently, no commercially available machine uses the S-COMAscheme. By contrast, there are several popular ccNUMA systems (HPSuperDome, SGI Origin3000) commercially available.</p><p>For all practical purposes we can classify these systems asbeing SM-MIMD machines also because special assistinghardware/software (such as a directory memory) has beenincorporated to establish a single system image although the memoryis physically distributed.</p></vmp:section><vmp:section  id="clusters" ><vmp:title>Clusters</vmp:title><p>The adoption of clusters, collections of workstations/PCsconnected by a local network, has virtually exploded since theintroduction of the first Beowulf cluster in 1994. The attractionlies in the (potentially) low cost of both hardware and softwareand the control that builders/users have over their system. Theinterest for clusters can be seen for instance from the active IEEETask Force on Cluster Computing (TFCC) which regularly issues aWhite Paper in which the current status of cluster computing isreviewed <a href="references.html#TFCC">[33]</a>. Also books how tobuild and maintain clusters have greatly added to their popularity(see, e.g.,<ahref="references.html#Ster99">[31]</a> and .As the cluster scene becomes relatively mature and an attractivemarket, large HPC vendors as well as many start-up companies haveentered the field and offer more or less ready out-of-the-boxcluster solutions for those groups that do not want to build theircluster from scratch.</p><p>The number of vendors that sell cluster configurations hasbecome so large that it is not sensible to include all theseproducts in this report. In addition, there is generally a largedifference in the usage of clusters and their more integratedcounterparts that we discuss in the following sections: clustersare mostly used for <em>capability computing</em> while theintegrated machines primarily are used for <em>capacitycomputing</em>. The first mode of usage meaning that the system isemployed for one or a few programs for which no alternative isreadily available in terms of computational capabilities. Thesecond way of operating a system is in employing it to the full byusing the most of its available cycles by many, often verydemanding, applications and users. Traditionally, vendors of largesupercomputer systems have learned to provide for this last mode ofoperation as the precious resources of their systems were requiredto be used as effectively as possible. By contrast, Beowulfclusters are mostly operated through the Linux operating system (asmall minority using Microsoft Windows) where these operatingsystems either miss the tools or these tools are relativelyimmature to use a cluster well for capacity computing. However, asclusters become on average both larger and more stable, there is atrend to use them also as computational capacity servers. In <ahref="references.html#Steen00">[30]</a> is looked at some of theaspects that are necessary conditions for this kind of use likeavailable cluster management tools and batch systems. In the samestudy also the performance on an application workload was assessed,both on a RISC (Compaq Alpha) based configuration and on IntelPentium III based systems. An important, but not very surprisingconclusion was that the speed of the network is very important inall but the most compute bound applications. Another notableobservation was that using compute nodes with more than 1 CPU maybe attractive from the point of view of compactness and (possibly)energy and cooling aspects, but that the performance can beseverely damaged by the fact that more CPUs have to draw on acommon node memory. The bandwidth of the nodes is in this case notup to the demands of memory intensive applications.</p><p>Fortunately, there is nowadays a fair choice of communicationnetworks available in clusters. Of course 100 Mb/s Ethernet isalways possible, which is attractive for economic reasons, but hasthe drawbacks of a very modest maximum bandwidth (about 10 MB/s)and a high latency (about 100 &micro;s). Gigabit Ethernet has amaximum bandwidth that is 10 times higher but has about the samelatency. Alternatively, there are for instance networks thatoperate from user space, like Myrinet <ahref="references.html#Myr00">[20]</a>, Giganet cLAN <ahref="references.html#Gigan01">[11]</a>, and SCI <ahref="references.html#JaLa90">[17]</a>. The first two have maximumbandwidths in the order of 100 MB/s and a latency in the range of15--20 &micro;s. SCI has a higher bandwidth (400--500 MB/stheoretically) and a latency under 10 &micro;s. The latter solutionis more costly but is nevertheless employed in some clusterconfigurations. The network speeds as shown by Myrinet, cLAN, and,certainly, SCI is more or less on par with some integrated parallelsystems as discussed later. So, possibly apart from the speed ofthe processors and of the software that is provided by the vendorsof DM-MIMD supercomputers, the distinction between clusters andthis class of machines becomes rather small and will undoubtlydecrease in the coming years.</p><p>The best starting point for the state-of-the-art in clustercomputing is given in the TFCC White Paper <ahref="references.html#TFCC">[33]</a> already mentioned. It gives anpointers to available products, both hardware and software, openquestions and the focus of the present research regarding thesequestions.</p></vmp:section><vmp:section id="processors"><vmp:title>Processors</vmp:title><p>In comparison to 10 years ago the processor scene has becomedrastically different. While in the period 1980--1990, theproprietary processors and in particular the vectorprocessors werethe driving forces of the supercomputers of that period, today thatrole has been taken on by common off-the-shelf RISC processors. Infact there are only three companies left that produce vectorsystems while all other systems that are offered are based on RISCCPUs (except the Cray MTA-2). We think, therefore, that it isuseful to give a brief description of the main processors thatpopulate the present supercomputers and look a little ahead to theprocessors that will follow in the coming year.</p><p>The modern RISC processors generally have a clock frequency thatis lower than that of the Intel Pentium 3/4 processors or thecorresponding AMD Intel look-alikes. However, they have a number offacilities that put them ahead in the speed of floating-pointoriented applications. Firstly, all RISC processors are able todeliver 2 or more 64-bit floating-point results in one clock cycle.Secondly, all of them feature out-of-order instruction execution,which enhances the number of instructions per cycle that can beprocessed (although the newer AMD processors also have 2-wayfloating-point instruction issuing and out-of-order execution, theyare limited by their adherence to the Intel x86 instruction set).Thirdly, the bandwidth from the processor to the memory, in case ofa cache miss, is larger than that of the Intel(-like) processors.Notwithstanding these commonalities between the various RISCprocessors, there are also differences in instruction latencies,number of instructions processed, etc., which we will addressbelow. We provide block diagrams for each of the processors to givea schematic idea of their structure. However, these figures do notreflect the actual layout of the devices on the respectivechips.</p>&processors;</vmp:section></vmp:chapter>