Linux and symmetric multiprocessor system.

5. SYMMETRIC MULTIPROCESSOR SYSTEMS

5.1. Distinctive features and advantages of symmetrical

multiprocessor systems

For the class of symmetric multiprocessor (SMP - symmetrical multiprocessor) systems are characterized by the following distinguishing features:

    the presence of two or more identical or similar processors in terms of characteristics;

    processors have access to shared memory, to which they are connected either through a common system backbone or through another mechanism for providing interaction, but in any case, the access time to memory resources from any processor is approximately the same;

    processors have access to common I/O facilities either through the same channel or through separate channels;

    all processors are capable of performing the same set of functions (hence the definition symmetrical system);

    the whole complex is controlled by a common operating system, which provides interaction between processors and programs at the level of tasks, files and data elements.

The first four features on this list hardly need further comment. As for the fifth attribute, it shows the most important difference between SMP systems and cluster systems, in which interaction between components is carried out, as a rule, at the level of individual messages or complete files. In an SMP system, information can be exchanged between components and at the level of individual data items, and thus a closer interaction between processes can be organized. In an SMP system, the distribution of processes or threads of tasks between individual processors is left to the operating system.

The most significant advantages of SMP systems over uniprocessor systems are as follows.

Productivity increase. If individual application tasks can run in parallel, a system with multiple processors will perform faster than a system with a single processor of the same type.

Reliability. Since all processors in an SMP system are of the same type and can perform the same tasks, if one of them fails, the task scheduled for it can be transferred to another processor. Therefore, the failure of one of the processors will not lead to a loss of operability of the entire system.

Possibility of functional expansion. The user can improve system performance by including additional processors.

Production of the same type systems of different performance. A computer manufacturer may offer customers a range of systems with the same architecture, but different cost and performance, differing in the number of processors.

It should be noted that all these advantages are most often potential and far from always being realized in practice.

A very attractive feature of SMP systems for users is its transparency. The operating system takes care of the distribution of tasks between individual processors and synchronization of their work.

5.2. Structural organizationSMP–systems

On fig. Figure 5.1 shows a generalized block diagram of a multiprocessor system.

Rice. 5.1. Generalized scheme of a multiprocessor system

The system has two or more processors, each of which has the entire set of necessary nodes - the control unit, ALU, registers and cache block. Each processor has access to the system's main memory and input/output devices through some interaction subsystem. Processors can exchange data and messages through the main memory (for this, a separate communication area is allocated in it). In addition, the system may also support the possibility of direct signal exchange between individual processors. Often, shared memory is organized in such a way that processors can access different blocks of it at the same time. In some systems, processors have local memory blocks and their own I/O channels in addition to shared resources.

Variants of the structural organization of multiprocessor systems can be classified as follows:

    systems with a common or time-separated backbone;

    systems with multiport memory;

    systems with a central control unit.

5.2.1. Common trunk systems

Using a common trunk in time-sharing mode is the easiest way to organize the joint work of processors in an SMP system (Fig. 5.2). The backbone structure and interface are practically the same as in a single-processor system. The trunk includes data lines, addresses and control signals. To simplify the operation of the direct memory access mechanism, the following measures are taken by the I/O modules.

Addressing is organized in such a way that modules can be distinguished by the address code when determining data sources and receivers.

Arbitration. Any I/O module can temporarily become a bus master. The arbitrator uses some sort of priority mechanism to resolve conflicts when there are competing requests to control the backbone.

Time division. When one of the modules gains the right to control the backbone, the remaining modules are blocked and must, if necessary, suspend operations and wait until they are granted access to the backbone.

These functions, which are common on single-processor systems, can be used on a multi-processor system without much modification. The main difference is that not only I/O modules, but also processors take part in the struggle for the right to access a memory block.


Rice. 5.2. Organization of an SMP system with a common backbone

The backbone link structure has several advantages over other approaches to implementing the interaction subsystem.

Simplicity. This option is the simplest, since the physical interface, addressing scheme, arbitration mechanism, and backbone resource sharing logic remain essentially the same as in a uniprocessor system.

Flexibility. A backbone system is fairly easy to reconfigure with new processors.

Reliability. The backbone is a passive medium, and the failure of any device connected to it does not lead to a loss of system operability as a whole.

The main disadvantage of a common rail system is limited performance. All accesses to the main memory must go through a single path - through a common backbone and, therefore, the performance of the system is limited by the cycle time of the backbone. In part, this problem can be solved by equipping each processor with its own cache memory block, which reduces the number of accesses to the main memory. As a rule, a two-level cache organization is used: the L1 cache is located in the processor LSI (internal cache), and the L2 cache is external.

However, the use of cache memory in a multiprocessor system creates a problem of consistency or information integrity of the caches of individual processors.

5.2.2. Systems with multiport memory

The use of multiport memory in SMP systems allows direct access of each processor and I/O module to a common array of information, regardless of all others (Fig. 5.3). In this case, each memory module must be equipped with a logic circuit for resolving possible conflicts. To do this, ports are most often assigned certain priorities. Typically, the electrical and physical interface of each port is identical to the device connected to that port; the memory module can be considered as a single port. Therefore, in the scheme of the processor or I / O module, almost no changes need to be made to connect to multiport memory.


Rice. 5.3. Multiport Memory Diagram

Only the scheme of the shared memory block becomes significantly more complicated, but this pays off with an increase in the performance of the system as a whole, since each processor has its own channel for accessing shared information. Another advantage of systems with this organization is the ability to allocate areas of memory for the exclusive use of a particular processor (or group of processors). This simplifies the creation of a system for protecting information from unauthorized access and storing recovery programs in memory areas inaccessible for modification by other processors.

There is one more important point when working with multiport memory. When updating information in the cache of any processor, it is necessary to write through the main memory, since there is no other way to notify other processors of changes in data.

5.2.3. Systems with a central control unit

The central control device organizes separate data flows between independent modules - processors, memory and input-output modules. The controller can remember requests and act as an arbiter and allocator of resources. It is also entrusted with the functions of transmitting information about the state, control messages and notifying processors about changes in information in the caches.

Since all logical functions related to the coordination of the system components are implemented in one central control unit, the interfaces of processors, memory and I/O modules remain practically unchanged. This provides the system with almost the same flexibility and simplicity as a common backbone. The main disadvantage of this approach is a significant complication of the control device circuit, which can potentially lead to a decrease in performance.

The structure with a central control unit was at one time widely used in the construction of multiprocessor computing systems based on large machines. Currently, they are very rare.

5.3. SMP-systems based on large computers

5.3.1. StructureSMP- systems based on large

computers

In most SMP systems for personal use and workstations, a system backbone is used to organize the interaction between components. Complexes based on large computers (mainframes) use an alternative approach . A block diagram of such a complex is shown in fig. 5.4. The family includes computers of different classes - from single-processor with a single main memory board to high-performance systems with a dozen processors and four blocks of main memory. The configuration also includes additional processors that act as I/O modules. The main components of computing systems based on large computers are as follows.

Processor PR - CISC is a microprocessor in which the processing of the most commonly used commands is controlled by hardware, and the rest of the commands are executed using firmware. The LSI of each PR includes a 64 KB L1 cache, which stores both commands and data.

Level cacheL2 384 KB in size. The L2 cache blocks are clustered two by two, with each cluster supporting three processors and providing access to the entire main memory address space.

Trunk changeover adapter- BSN (busswitching network adapter), which organizes the connection of L2 cache blocks with one of the four main memory blocks. The BSN also includes a 2 MB L3 cache.

Single Board Main Memory Block 8 GB. The complex includes four such blocks, providing a total amount of main memory of 32 GB.

There are several features in this structure that are worth dwelling on in more detail:

    switchable interconnection subsystem;

    shared L2 cache;

    L3 cache.


Fig.5.4. Block diagram of an SMP system based on large machines

5.3.2. Switchable interconnection subsystem

In SMP systems for personal use and workstations, a structure with a single system backbone is common. In this variant, the backbone can eventually become a bottleneck that prevents further growth of the system - adding new components to it. Designers of SMP systems based on large machines have tried to deal with this problem in two ways.

First, they divided the main memory subsystem into four single-board blocks, equipping each of them with its own controller, which is capable of processing memory requests at high speed. As a result, the total throughput of the memory access channel has quadrupled.

Secondly, the link between each processor (in fact, between its L2 cache) and a single block of memory is not implemented in the form of a shared backbone, but rather in the form of a point-to-point connection - each link connects a group of three processors through the L2 cache to the module BSN. In turn, the BSN acts as a switch that combines five communication channels (four with L2 caches and one with a memory block) - connects four physical channels into one logical data transmission highway. Thus, the signal arriving on any of the four channels connected to the L2 caches is duplicated on the remaining three channels, and thus the information integrity of the caches is ensured.

Although there are four separate memory blocks in the system, each processor and each L2 cache block has only two physical ports through which they communicate with the main memory subsystem. This solution was chosen because each L2 cache block can store data from only half of the entire memory address space. A pair of cache blocks is used to serve requests to the entire address space, and each processor must have access to both blocks in the pair.

5.3.3. Sharing Level Cache BlocksL2

In a typical SMP system structure, each processor has its own cache blocks (usually two levels). In recent years, the concept of sharing L2 cache blocks has attracted increasing interest from system designers. An early version of the SMP system based on a large machine used 12 L2 cache blocks, each of which was "owned" by one specific processor. In later versions, blocks of L2 caches are shared among multiple processors. This was done based on the following considerations.

The new versions use processors that are twice as fast as the processors of the first version. If at the same time the previous structure of cache blocks is left, the flow of information through the backbone subsystem will increase significantly. At the same time, the designers were tasked with making the most of the ready-made blocks designed for the old version. If the backbone subsystem is not upgraded, then the BSN units in the new version can become a bottleneck over time.

An analysis of typical applications running in the system showed that a fairly large part of commands and data is shared by different processors.

Therefore, the developers of the new version of the system considered the option of sharing one or more L2 cache blocks among several processors (while still each processor has its own internal L1 cache block). At first glance, the idea of ​​sharing the L2 cache looks unattractive, since the processor must additionally seek access to it, and this may lead to performance degradation. But if a significant portion of the data and instructions in the cache is needed by multiple processors, then a shared cache can increase system throughput rather than reduce it. Data needed by multiple processors will be found in the shared cache faster than if it had to be passed through backbone subsystems.

The developers of the new version of the system also looked at the option of including a single large cache shared by all processors in the system. Although this structure of the system promised an even greater increase in productivity, it had to be abandoned, since this option required a complete redesign of the entire existing organization of communications. An analysis of the data flows in the system has shown that the introduction of sharing the cache blocks associated with each of the existing BSNs will already achieve a very tangible increase in system performance. At the same time, in comparison with caches for individual use, the percentage of hits when accessing the cache increases significantly and, accordingly, the number of accesses to the main memory decreases.

5.3.4. Level cacheL3

Another feature of the SMP system based on a large machine is the inclusion of a third-level cache, L3, in its structure. The L3 cache is included in each BSN and is therefore a buffer between the L2 caches and one of the main memory blocks. The use of this cache allows you to reduce the delay in the arrival of data that is not in the L1 and L2 caches. If this data was previously required by any of the processors, then they are present in the L3 cache and can be transferred to the new processor. The time to retrieve data from the L3 cache is less than the time to access the main memory block, which provides a performance gain.

In table. Figure 5.1 shows performance data from a typical IBM S/390 based SMP system. The "access delay" indicator characterizes the time required for the processor to retrieve data if they are present in one or another structural element of the memory subsystem. When a processor requests new information, in 89% of cases it is found in its own L1 cache. In the remaining 11% of cases, you have to access the caches of the next levels or the main memory block. In 5% of cases, the necessary information is found in the L2 cache, and so on. Only 3% of the time you have to eventually access the main memory block. Without the third level cache, this figure would be twice as high.

Table 5.1

Efficiency indicators of the elements of the memory subsystem

in an SMP system based on IBM S/390

Memory Subsystem

Access delay (work cycles

processor)

Hit Percentage

Main memory

5.4. Information integrity of caches and protocolMESI

5.4.1. Ways to solve the problem of information

integrity

In modern computing systems, it has become the norm to use cache blocks of one or two levels associated with each processor. Such an organization makes it possible to achieve high system performance, but creates the problem of data integrity in the caches of different processors. Its essence lies in the fact that it is possible to store copies of the same data from the main memory in the caches of different processors. If at the same time some processor updates any of the elements of such data in its copy, then the copies that other processors deal with become unreliable, as well as the contents of the main memory. It is possible to use two variants of the methodology for duplicating the changes made in the main memory:

writeback (write back). In this variant, the processor makes changes only to the contents of its cache. The contents of the line are written to main memory when it becomes necessary to clear the modified cache line to receive a new block of data.

Write Through (write through). All cache writes are immediately duplicated in main memory, without waiting for the moment when the contents of the corresponding cache line need to be replaced. As a result, you can always be sure that the most recent, and therefore reliable, information is stored in the RAM at any time.

Obviously, the use of the writeback technique can lead to a violation of the information integrity of the data in the caches, since until the updated data is rewritten to the main memory, the caches of other processors will contain invalid data. Even with the write-through technique, information integrity is not guaranteed, since the changes made need to be duplicated not only in main memory, but also in all cache blocks containing the original copies of this data.

systems Guidelines

University Department of Computer Measuring systems and Metrology ________________________________________________ I N F O R M A T I C A Architecturecomputingsystems. The main subsystems of a personal computer ...

SMP architecture is a symmetrical multiprocessor architecture. The main feature of systems with SMP architecture is the presence of a common physical memory shared by all processors.

The SMP system is built on the basis of a high-speed system bus, to the slots of which three types of functional blocks are connected:

●processors (CPU),

● RAM (RAM),

● input/output (I/O) subsystem.

Memory is a way of passing messages between processors. All computing devices when accessing the OP have equal rights and the same addressing for all memory cells. The latter circumstance makes it possible to effectively exchange data with other computing devices. The SMP system runs under a single operating system (either UNIX-like or Windows). The OS automatically distributes processes across processors, and explicit binding is also possible. The SMP architecture is used in servers and workstations based on processors from Intel, AMD, Sun, IBM, HP, etc.

Organization principles:

An SMP system consists of several homogeneous processors and a shared memory array. Each memory access operation is interpreted as a transaction on the processor-memory bus. The word "peer" means that each processor can do everything that any other. Each processor has access to all memory, can perform any I/O operation, interrupt other processors, and so on. In SMP, each processor has at least one private cache.

Coherence of caches is supported by hardware.

Advantages:

· Simplicity and versatility for programming. The SMP architecture does not impose restrictions on the programming model used when creating an application: the parallel branches model is usually used, when all processors work absolutely independently of each other - however, models that use interprocessor exchange can also be implemented. The use of shared memory increases the speed of such an exchange, the user also has access to the entire amount of memory at once.

· Ease of use. As a rule, SMP systems use an air-conditioned cooling system, which makes them easier to maintain.

Relatively low price.

· The implicit transfer of data between caches by the SMP hardware is the fastest and cheapest means of communication in any parallel general purpose architecture.

· Readiness. In a symmetric multiprocessor, the failure of one of the components does not lead to the failure of the system, since any of the processors is able to perform the same functions as the others.

Flaws:

SMP systems do not scale well:

1. The system bus has a limited bandwidth and a limited number of slots.

2. The bus can only process one transaction at a time, which results in conflict resolution problems when multiple processors access the same areas of shared physical memory at the same time.

In real systems, no more than 8-16-32 processors can be effectively used.

Application:

SMP is often used in science, industry, and business, where software is specifically designed for multi-threaded execution. At the same time, most consumer products, such as text editors and computer games, are written in such a way that they cannot get much benefit from SMP systems. In the case of games, this is often due to the fact that optimizing the program for SMP systems will lead to performance loss when working on single-processor systems, which occupy a large part of the market.

Examples of computers with SMP architecture:

HP 9000 (up to 32 processors), Sun HPC 100000 (up to 64 percent), Compaq AlphaServer (up to 32 percent), Sun SPARC Enterprise T5220
2.8.MPP architecture. The history of development. Basic principles. The concept, architecture and characteristics of the Intel Paragon supercomputer.

Massive Parallel Processing is a class of architectures for parallel computing systems. The peculiarity of the architecture is that the memory is physically separated. The system is built from separate nodes containing a processor, a local OP bank, communication processors or network adapters, sometimes hard disks and/or other input/output devices.

Only processors from the same node have access to the SP bank of a given node. Nodes are connected by special communication channels. The user can determine the logical number of the processor to which he is connected, and organize the exchange of messages with other processors. MPP machines use two operating system options:

● In one, the full-fledged OS runs only on the host machine, while each node runs a heavily stripped-down version of the OS that runs the branch of the parallel application that resides on it.

● In the second option, each module runs a complete, most often UNIX-like OS, which is installed separately.


SMP architecture

SMP architecture (symmetric multiprocessing) - symmetric multiprocessing architecture. The main feature of systems with SMP architecture is the presence of a common physical memory shared by all processors.

Memory is a way of transferring messages between processors, while all computing devices, when accessing it, have equal rights and the same addressing for all memory cells. Therefore, the SMP architecture is called symmetrical.

The main advantages of SMP systems:

Simplicity and versatility for programming. The SMP architecture does not impose restrictions on the programming model used when creating an application: the parallel branches model is usually used, when all processors work absolutely independently of each other - however, models that use interprocessor exchange can also be implemented. The use of shared memory increases the speed of such an exchange, the user also has access to the entire amount of memory at once. For SMP systems, there are relatively efficient means of automatic parallelization.

Ease of use. Typically, SMP systems use an air conditioning based cooling system to facilitate their maintenance.

Relatively low price.

Flaws:

Shared memory systems built on the system bus do not scale well
This important drawback of the SMP system does not allow us to consider them truly promising. The reason for poor scalability is that the bus can only process one transaction at a time, which causes conflict resolution problems when multiple processors access the same areas of shared physical memory at the same time.

MPP architecture

MPP architecture (massive parallel processing) - massively parallel architecture. The main feature of this architecture is that the memory is physically separated. In this case, the system is built from separate modules containing a processor, a local operating memory bank (RAM), two communication processors (routers) or a network adapter, sometimes hard drives and/or other input/output devices.

Main advantage:

The main advantage of shared memory systems is good scalability: unlike SMP systems, in shared memory machines, each processor has access only to its own local memory, and therefore there is no need for clock synchronization of processors. Almost all performance records today are set on machines of exactly this architecture, consisting of several thousand processors (ASCI Red, ASCI Blue Pacific).

Flaws:

The absence of shared memory significantly reduces the speed of interprocessor exchange, since there is no common medium for storing data intended for exchange between processors. A special programming technique is required to implement message exchange between processors.
each processor can only use a limited amount of the local memory bank.
due to these architectural flaws, significant efforts are required to maximize the use of system resources. This is what determines the high price of software for massively parallel systems with separate memory.

PVP architecture

PVP (Parallel Vector Process) - parallel architecture with vector processors.
The main feature of PVP systems is the presence of special vector-pipeline processors, which provide instructions for the same type of processing of vectors of independent data, which are effectively executed on pipelined functional units. As a rule, several of these processors (1-16) work simultaneously with shared memory (similar to SMP) in multiprocessor configurations. Several such nodes can be combined using a switch (similar to MPP). Since data transfer in a vector format is much faster than in a scalar one (the maximum speed can be 64 Gb / s, which is 2 orders of magnitude faster than in scalar machines), the problem of interaction between data streams during parallelization becomes insignificant. And what is badly parallelized on scalar machines is well parallelized on vector ones. Thus, systems of PVP architecture can be general-purpose machines (general purpose systems). However, since vector processors are quite expensive, these machines will not be available to the public.

cluster architecture

A cluster is two or more computers (often called nodes) connected using network technologies based on a bus architecture or a switch and appearing to users as a single information and computing resource. Servers, workstations and even ordinary personal computers can be selected as cluster nodes. The uptime benefit of clustering becomes apparent in the event of a node failure: another node in the cluster can take over the load of the failed node without users noticing the interruption in access.

Amdahl's law(English) Amdahl's law, sometimes also Amdahl-Ware law) - illustrates the limitation of growth in the performance of a computing system with an increase in the number of computers. Gene Amdals formulated the law in 1967, having discovered a limitation on the growth of productivity when paralleling calculations, which is simple in essence, but insurmountable in content: “In the case when a task is divided into several parts, the total time of its execution on a parallel system cannot be less than the execution time of the longest fragment." According to this law, the acceleration of program execution due to the parallelization of its instructions on a set of calculators is limited by the time required to execute its sequential instructions.

mathematical expression

Suppose we need to solve some computational problem. Suppose that its algorithm is such that the share of the total amount of calculations can be obtained only by sequential calculations, and, accordingly, the share can be ideally parallelized (that is, the calculation time will be inversely proportional to the number of nodes involved). Then the acceleration that can be obtained on a computing system of processors, compared to a single-processor solution, will not exceed the value

Illustration

The table shows how many times faster the program will be executed with the proportion of sequential calculations when using processors. . So, if half of the code is sequential, then the total gain will never exceed two.

]Ideological value

Amdahl's law shows that the increase in computational efficiency depends on the problem algorithm and is bounded from above for any problem with . Not for every task it makes sense to increase the number of processors in a computer system.

Moreover, if we take into account the time required for data transfer between the nodes of the computing system, then the dependence of the computation time on the number of nodes will have a maximum. This imposes a limitation on the scalability of the computing system, that is, it means that from a certain point on, adding new nodes to the system will increase task calculation time.

5.2. Symmetric SMP architecture

SMP (symmetric multiprocessing) – symmetrical multiprocessor architecture. The main feature of systems with SMP architecture(fig.5.5) is the presence of a common physical memory shared by all processors.

Figure 5.5 - Schematic view of the SMP architecture

Memory is used, in particular, to transfer messages between processors, while all computing devices, when accessing it, have equal rights and the same addressing for all memory cells. Therefore, the SMP architecture is called symmetrical. The latter circumstance allows very efficient data exchange with other computing devices.

The SMP system is built on the basis high speed system bus(SGI PowerPath, Sun Gigaplane, DEC TurboLaser), to the slots of which functional blocks of types are connected: processors (CPU), input / output subsystem (I / O), etc. Slower buses are used to connect to I / O modules (PCI, VME64).

The most well-known SMP systems are SMP servers and workstations based on Intel processors (IBM, HP, Compaq, Dell, ALR, Unisys, DG, Fujitsu, etc.). The entire system runs under a single OS (usually UNIX-like, but Windows NT is supported for Intel platforms). The OS automatically (during operation) distributes processes among processors, but explicit binding is sometimes possible.

The main advantages of SMP systems:

- simplicity and versatility for programming. The SMP architecture does not impose restrictions on the programming model used when creating an application: a parallel branch model is usually used, when all processors work independently of each other. However, models that use interprocessor exchange can also be implemented. The use of shared memory increases the speed of such an exchange, the user also has access to the entire amount of memory at once. For SMP systems, there are quite efficient means of automatic parallelization;

- ease of operation. As a rule, SMP systems use an air-cooled air conditioning system, which makes them easier to maintain;

- relatively low price.

Flaws:

- shared memory systems do not scale well.

This significant drawback of SMP systems does not allow us to consider them truly promising. The reason for poor scalability is that the bus is currently only capable of processing a single transaction, which causes conflict resolution problems when multiple processors access the same areas of shared physical memory at the same time.

Currently, conflicts can occur between 8-24 processors. All this obviously hinders the increase in performance with an increase in the number of processors and the number of connected users. In real systems, you can use no more than 32 processors. To build scalable systems based on SMP, cluster or NUMA architectures are used. When working with SMP systems, the so-called shared memory programming paradigm is used.

For MIMD systems, there is currently a generally accepted classification based on the methods used to organize RAM in these systems. This classification is primarily multiprocessor computing systems(or multiprocessors) or computing systems with shared memory (multiprocessors, common memory systems, shared-memory systems) and multicomputer computing systems(multicomputers) or computing systems with distributed memory (multicomputers, distributed memory systems). The structure of the multiprocessor and multicomputer systems is shown in fig. 1, where is the processor, is the memory module.

Rice. 1. a) - the structure of the multiprocessor; b) – the structure of a multicomputer.

Multiprocessors.

In multiprocessors, the address space of all processors is the same. This means that if the same variable occurs in the programs of several processors of a multiprocessor, then these processors will access one physical cell of the shared memory to get or change the value of this variable. This circumstance has both positive and negative consequences.

On the one hand, there is no need to physically move data between switching programs, which eliminates the time spent on interprocessor exchange.

On the other hand, since the simultaneous access of several processors to common data can lead to incorrect results, systems for synchronizing parallel processes and ensuring memory coherence are needed. Because processors need to access shared memory very frequently, the bandwidth requirements of the communication medium are extremely high.

The latter circumstance limits the number of processors in multiprocessors to a few dozen. The acuteness of the problem of access to shared memory can be partially removed by dividing the memory into blocks that allow you to parallelize memory accesses from different processors.

We note one more advantage of multiprocessors - a multiprocessor system operates under the control of a single copy of the operating system (usually UNIX-like) and does not require individual configuration of each processor node.

Homogeneous multiprocessors with equal (symmetrical) access to shared RAM are usually called SMP systems (systems with symmetrical multiprocessor architecture). SMP systems emerged as an alternative to expensive multiprocessor systems based on vector-pipeline processors and vector-parallel processors (see Fig. 2).


Multicomputers.

Due to the simplicity of their architecture, multicomputers are currently the most widely used. Multicomputers do not have shared memory. Therefore, interprocessor exchange in such systems is usually carried out through a communication network using message passing.

Each processor in a multicomputer has an independent address space. Therefore, the presence of a variable with the same name in the programs of different processors leads to access to physically different cells of the own memory of these processors. This circumstance requires the physical movement of data between switching programs in different processors. Most often, the main part of the calls is made by each processor to its own memory. Therefore, the requirements for the switching environment are relaxed. As a result, the number of processors in multicomputer systems can reach several thousand, tens of thousands, and even hundreds of thousands.

The peak performance of the largest shared memory systems is lower than the peak performance of the largest distributed memory systems; the cost of systems with shared memory is higher than the cost of similar systems with distributed memory.

Homogeneous multicomputers with distributed memory are called computing systems with massively parallel architecture(MPP systems) - see fig.2.

Something in between SMP systems and MPP systems are NUMA systems.


Cluster systems (computing clusters).

Cluster systems(computing clusters) are a cheaper version of MPP systems. A computing cluster consists of a set of personal computers or workstations) connected by a local network as a communication medium. Computational clusters are considered in detail later.

Rice. 2. Classification of multiprocessors and multicomputers.

SMP systems

All processors in an SMP system have symmetrical memory access; SMP system memory is UMA memory. Symmetry means the following: equal rights of all processors to access memory; the same addressing for all memory elements; equal access time of all system processors to memory (excluding deadlocks).

The general structure of the SMP system is shown in fig. 3. The communication environment of an SMP system is based on some kind of high-speed system bus or high-speed switch. In addition to identical processors and shared memory M, I/O devices are connected to the same bus or switch.

Behind the apparent simplicity of SMP systems, there are significant problems associated mainly with RAM. The fact is that at present the speed of the RAM is significantly behind the speed of the processor. To bridge this gap, modern processors are equipped with high-speed buffer memory (cache memory). The speed of access to this memory is several tens of times higher than the speed of access to the main memory of the processor. However, the presence of cache memory violates the principle of equal access to any memory point, since the data in the cache memory of one processor is not available to other processors. Therefore, after each modification of a copy of a variable located in the cache memory of any processor, it is necessary to perform a synchronous modification of this variable itself, located in the main memory. In modern SMP systems, cache coherency is maintained by hardware or by the operating system.

Rice. 3. General structure of the SMP system

The most well-known SMP systems are SMP servers and workstations from IBM, HP, Compaq, Dell, Fujitsu, and others. An SMP system operates under a single operating system (most often UNIX and the like).

Due to the limited bandwidth of the communication environment, SMP systems do not scale well. Currently, real systems use no more than a few dozen processors.

A well-known disadvantage of SMP systems is that their cost increases faster than performance as the number of processors in the system increases.

MPP systems.

MPP systems are built from processor nodes containing a processor, local RAM, communication processor or network adapter, sometimes hard drives and/or other input/output devices. In fact, such modules are full-featured computers (see Fig. 4.). Only the processor of the same module has access to the RAM block of this module. Modules interact with each other through some communication medium. There are two variants of operating system operation on MPP-systems. In one embodiment, a complete operating system runs only on the host computer, and each individual module runs a heavily stripped-down version of the operating system that supports only the basic functions of the operating system kernel. In the second option, a full-fledged UNIX-like operating system runs on each module. Note that the need for the presence (in one form or another) on each processor of the MPP system of the operating system allows using only a limited amount of memory for each of the processors.

Compared to SMP systems, the architecture of the MPP system eliminates both the memory contention problem and the cache coherency problem at the same time.

The main advantage of MPP systems is good scalability. So the CRAY T3E series supercomputers are capable of scaling up to 2048 processors. Almost all performance records to date have been set on MPP systems consisting of several thousand processors.

Rice. 4. General structure of the MPP system.

On the other hand, the lack of shared memory significantly reduces the speed of interprocessor exchange in MPP systems. This circumstance for MPP systems brings to the fore the problem of the efficiency of the communication medium.

In addition, in MPP systems, a special programming technique is required to implement communication between processors. This explains the high price of software for MPP systems. This also explains why writing efficient parallel programs for MPP systems is a more difficult task than writing the same programs for SMP systems. For a wide range of problems for which well-established sequential algorithms are known, it is not possible to construct efficient parallel algorithms for MPP systems.

NUMA systems.

Logically shared data access can also be provided with physically allocated memory. In this case, the distance between different processors and different memory elements, generally speaking, is different, and the duration of access of different processors to different memory elements is different. Those. the memory of such systems is NUMA memory.

A NUMA system is usually built around homogeneous processor nodes, consisting of a small number of processors and a block of memory. The modules are connected using some high-speed communication medium (see Figure 5). A single address space is supported, access to remote memory is supported in hardware, i.e. to the memory of other modules. At the same time, access to local memory is carried out several times faster than to remote memory. In essence, a NUMA system is an MPP system, where SMP nodes are used as separate computing elements.

Among NUMA systems, the following types of systems are distinguished:

  • COMA systems, in which only the local cache memory of processors is used as RAM (cache-only memory architecture - COMA);
  • CC-NUMA-systems, in which the hardware ensures the coherence of the local cache memory of different processors (cache-coherent NUMA - CC-NUMA);
  • NCC-NUMA systems, in which the hardware does not support the coherence of the local cache memory of different processors (non-cache coherent NUMA - NCC-NUMA). This type includes, for example, the Cray T3E system.

Rice. 5. General structure of the NUMA system.

The logical accessibility of memory in NUMA systems, on the one hand, allows you to work with a single address space, and, on the other hand, allows you to simply ensure high system scalability. This technology currently makes it possible to create systems containing up to several hundred processors.

NUMA systems are mass-produced by many computer companies as multiprocessor servers and firmly hold the lead in the class of small supercomputers.