Linux and symmetric multiprocessor system.


For the class of symmetric multiprocessor (SMP - symmetrical multiprocessor) systems are characterized by the following distinguishing features:

    the presence of two or more identical or similar processors in terms of characteristics;

    processors have access to shared memory, to which they are connected either through a common system backbone or through another mechanism for providing interaction, but in any case, the access time to memory resources from any processor is approximately the same;

    processors have access to common I/O facilities either through the same channel or through separate channels;

    all processors are capable of performing the same set of functions (hence the definition symmetrical system);

    the whole complex is controlled by a common operating system, which provides interaction between processors and programs at the level of tasks, files and data elements.

The first four features on this list hardly need further comment. As for the fifth attribute, it shows the most important difference between SMP systems and cluster systems, in which interaction between components is carried out, as a rule, at the level of individual messages or complete files. In an SMP system, information can be exchanged between components and at the level of individual data items, and thus a closer interaction between processes can be organized. In an SMP system, the distribution of processes or threads of tasks between individual processors is left to the operating system.

The most significant advantages of SMP systems over uniprocessor systems are as follows.

Productivity increase. If individual application tasks can run in parallel, a system with multiple processors will perform faster than a system with a single processor of the same type.

Reliability. Since all processors in an SMP system are of the same type and can perform the same tasks, if one of them fails, the task scheduled for it can be transferred to another processor. Therefore, the failure of one of the processors will not lead to a loss of operability of the entire system.

Possibility of functional expansion. The user can improve system performance by including additional processors.

Production of the same type systems of different performance. A computer manufacturer may offer customers a range of systems with the same architecture, but different cost and performance, differing in the number of processors.

It should be noted that all these advantages are most often potential and far from always being realized in practice.

A very attractive feature of SMP systems for users is its transparency. The operating system takes care of the distribution of tasks between individual processors and synchronization of their work.

On fig. Figure 5.1 shows a generalized block diagram of a multiprocessor system.

Rice. 5.1. Generalized scheme of a multiprocessor system

The system has two or more processors, each of which has the entire set of necessary nodes - the control unit, ALU, registers and cache block. Each processor has access to the system's main memory and input/output devices through some interaction subsystem. Processors can exchange data and messages through the main memory (for this, a separate communication area is allocated in it). In addition, the system may also support the possibility of direct signal exchange between individual processors. Often, shared memory is organized in such a way that processors can access different blocks of it at the same time. In some systems, processors have local memory blocks and their own I/O channels in addition to shared resources.

Variants of the structural organization of multiprocessor systems can be classified as follows:

    systems with a common or time-separated backbone;

    systems with multiport memory;

    systems with a central control unit.

5.2.1. Common trunk systems

Using a common trunk in time-sharing mode is the easiest way to organize the joint work of processors in an SMP system (Fig. 5.2). The backbone structure and interface are practically the same as in a single-processor system. The trunk includes data lines, addresses and control signals. To simplify the operation of the direct memory access mechanism, the following measures are taken by the I/O modules.

Addressing is organized in such a way that modules can be distinguished by the address code when determining data sources and receivers.

Arbitration. Any I/O module can temporarily become a bus master. The arbitrator uses some sort of priority mechanism to resolve conflicts when there are competing requests to control the backbone.

Time division. When one of the modules gains the right to control the backbone, the remaining modules are blocked and must, if necessary, suspend operations and wait until they are granted access to the backbone.

These functions, which are common on single-processor systems, can be used on a multi-processor system without much modification. The main difference is that not only I/O modules, but also processors take part in the struggle for the right to access a memory block.

Rice. 5.2. Organization of an SMP system with a common backbone

The backbone link structure has several advantages over other approaches to implementing the interaction subsystem.

Simplicity. This option is the simplest, since the physical interface, addressing scheme, arbitration mechanism, and backbone resource sharing logic remain essentially the same as in a uniprocessor system.

Flexibility. A backbone system is fairly easy to reconfigure with new processors.

Reliability. The backbone is a passive medium, and the failure of any device connected to it does not lead to a loss of system operability as a whole.

The main disadvantage of a common rail system is limited performance. All accesses to the main memory must go through a single path - through a common backbone and, therefore, the performance of the system is limited by the cycle time of the backbone. In part, this problem can be solved by equipping each processor with its own cache memory block, which reduces the number of accesses to the main memory. As a rule, a two-level cache organization is used: the L1 cache is located in the processor LSI (internal cache), and the L2 cache is external.

However, the use of cache memory in a multiprocessor system creates a problem of consistency or information integrity of the caches of individual processors.

5.2.2. Systems with multiport memory

The use of multiport memory in SMP systems allows direct access of each processor and I/O module to a common array of information, regardless of all others (Fig. 5.3). In this case, each memory module must be equipped with a logic circuit for resolving possible conflicts. To do this, ports are most often assigned certain priorities. Typically, the electrical and physical interface of each port is identical to the device connected to that port; the memory module can be considered as a single port. Therefore, in the scheme of the processor or I / O module, almost no changes need to be made to connect to multiport memory.

Rice. 5.3. Multiport Memory Diagram

Only the scheme of the shared memory block becomes significantly more complicated, but this pays off with an increase in the performance of the system as a whole, since each processor has its own channel for accessing shared information. Another advantage of systems with this organization is the ability to allocate areas of memory for the exclusive use of a particular processor (or group of processors). This simplifies the creation of a system for protecting information from unauthorized access and storing recovery programs in memory areas inaccessible for modification by other processors.

There is one more important point when working with multiport memory. When updating information in the cache of any processor, it is necessary to write through the main memory, since there is no other way to notify other processors of changes in data.

5.2.3. Systems with a central control unit

The central control device organizes separate data flows between independent modules - processors, memory and input-output modules. The controller can remember requests and act as an arbiter and allocator of resources. It is also entrusted with the functions of transmitting information about the state, control messages and notifying processors about changes in information in the caches.

Since all logical functions related to the coordination of the system components are implemented in one central control unit, the interfaces of processors, memory and I/O modules remain practically unchanged. This provides the system with almost the same flexibility and simplicity as a common backbone. The main disadvantage of this approach is a significant complication of the control device circuit, which can potentially lead to a decrease in performance.

The structure with a central control unit was at one time widely used in the construction of multiprocessor computing systems based on large machines. Currently, they are very rare.

5.3. SMP-systems based on large computers

In most SMP systems for personal use and workstations, a system backbone is used to organize the interaction between components. Complexes based on large computers (mainframes) use an alternative approach . A block diagram of such a complex is shown in fig. 5.4. The family includes computers of different classes - from single-processor with a single main memory board to high-performance systems with a dozen processors and four blocks of main memory. The configuration also includes additional processors that act as I/O modules. The main components of computing systems based on large computers are as follows.

Processor PR - CISC is a microprocessor in which the processing of the most commonly used commands is controlled by hardware, and the rest of the commands are executed using firmware. The LSI of each PR includes a 64 KB L1 cache, which stores both commands and data.

Level cacheL2 384 KB in size. The L2 cache blocks are clustered two by two, with each cluster supporting three processors and providing access to the entire main memory address space.

Trunk changeover adapter- BSN (busswitching network adapter), which organizes the connection of L2 cache blocks with one of the four main memory blocks. The BSN also includes a 2 MB L3 cache.

Single Board Main Memory Block 8 GB. The complex includes four such blocks, providing a total amount of main memory of 32 GB.

There are several features in this structure that are worth dwelling on in more detail:

    switchable interconnection subsystem;

    shared L2 cache;

    L3 cache.

Fig.5.4. Block diagram of an SMP system based on large machines

5.3.2. Switchable interconnection subsystem

In SMP systems for personal use and workstations, a structure with a single system backbone is common. In this variant, the backbone can eventually become a bottleneck that prevents further growth of the system - adding new components to it. Designers of SMP systems based on large machines have tried to deal with this problem in two ways.

First, they divided the main memory subsystem into four single-board blocks, equipping each of them with its own controller, which is capable of processing memory requests at high speed. As a result, the total throughput of the memory access channel has quadrupled.

Secondly, the link between each processor (in fact, between its L2 cache) and a single block of memory is not implemented in the form of a shared backbone, but rather in the form of a point-to-point connection - each link connects a group of three processors through the L2 cache to the module BSN. In turn, the BSN acts as a switch that combines five communication channels (four with L2 caches and one with a memory block) - connects four physical channels into one logical data transmission highway. Thus, the signal arriving on any of the four channels connected to the L2 caches is duplicated on the remaining three channels, and thus the information integrity of the caches is ensured.

Although there are four separate memory blocks in the system, each processor and each L2 cache block has only two physical ports through which they communicate with the main memory subsystem. This solution was chosen because each L2 cache block can store data from only half of the entire memory address space. A pair of cache blocks is used to serve requests to the entire address space, and each processor must have access to both blocks in the pair.

5.3.3. Sharing Level Cache BlocksL2

In a typical SMP system structure, each processor has its own cache blocks (usually two levels). In recent years, the concept of sharing L2 cache blocks has attracted increasing interest from system designers. An early version of the SMP system based on a large machine used 12 L2 cache blocks, each of which was "owned" by one specific processor. In later versions, blocks of L2 caches are shared among multiple processors. This was done based on the following considerations.

The new versions use processors that are twice as fast as the processors of the first version. If at the same time the previous structure of cache blocks is left, the flow of information through the backbone subsystem will increase significantly. At the same time, the designers were tasked with making the most of the ready-made blocks designed for the old version. If the backbone subsystem is not upgraded, then the BSN units in the new version can become a bottleneck over time.

An analysis of typical applications running in the system showed that a fairly large part of commands and data is shared by different processors.

Therefore, the developers of the new version of the system considered the option of sharing one or more L2 cache blocks among several processors (while still each processor has its own internal L1 cache block). At first glance, the idea of ​​sharing the L2 cache looks unattractive, since the processor must additionally seek access to it, and this may lead to performance degradation. But if a significant portion of the data and instructions in the cache is needed by multiple processors, then a shared cache can increase system throughput rather than reduce it. Data needed by multiple processors will be found in the shared cache faster than if it had to be passed through backbone subsystems.

The developers of the new version of the system also looked at the option of including a single large cache shared by all processors in the system. Although this structure of the system promised an even greater increase in productivity, it had to be abandoned, since this option required a complete redesign of the entire existing organization of communications. An analysis of the data flows in the system has shown that the introduction of sharing the cache blocks associated with each of the existing BSNs will already achieve a very tangible increase in system performance. At the same time, in comparison with caches for individual use, the percentage of hits when accessing the cache increases significantly and, accordingly, the number of accesses to the main memory decreases.

5.3.4. Level cacheL3

Another feature of the SMP system based on a large machine is the inclusion of a third-level cache, L3, in its structure. The L3 cache is included in each BSN and is therefore a buffer between the L2 caches and one of the main memory blocks. The use of this cache allows you to reduce the delay in the arrival of data that is not in the L1 and L2 caches. If this data was previously required by any of the processors, then they are present in the L3 cache and can be transferred to the new processor. The time to retrieve data from the L3 cache is less than the time to access the main memory block, which provides a performance gain.

In table. Figure 5.1 shows performance data from a typical IBM S/390 based SMP system. The "access delay" indicator characterizes the time required for the processor to retrieve data if they are present in one or another structural element of the memory subsystem. When a processor requests new information, in 89% of cases it is found in its own L1 cache. In the remaining 11% of cases, you have to access the caches of the next levels or the main memory block. In 5% of cases, the necessary information is found in the L2 cache, and so on. Only 3% of the time you have to eventually access the main memory block. Without the third level cache, this figure would be twice as high.

Table 5.1

Efficiency indicators of the elements of the memory subsystem

in an SMP system based on IBM S/390

Memory Subsystem

Access delay (work cycles


Hit Percentage

Main memory

5.4. Information integrity of caches and protocolMESI

In modern computing systems, it has become the norm to use cache blocks of one or two levels associated with each processor. Such an organization makes it possible to achieve high system performance, but creates the problem of data integrity in the caches of different processors. Its essence lies in the fact that it is possible to store copies of the same data from the main memory in the caches of different processors. If at the same time some processor updates any of the elements of such data in its copy, then the copies that other processors deal with become unreliable, as well as the contents of the main memory. It is possible to use two variants of the methodology for duplicating the changes made in the main memory:

writeback (write back). In this variant, the processor makes changes only to the contents of its cache. The contents of the line are written to main memory when it becomes necessary to clear the modified cache line to receive a new block of data.

Write Through (write through). All cache writes are immediately duplicated in main memory, without waiting for the moment when the contents of the corresponding cache line need to be replaced. As a result, you can always be sure that the most recent, and therefore reliable, information is stored in the RAM at any time.

Obviously, the use of the writeback technique can lead to a violation of the information integrity of the data in the caches, since until the updated data is rewritten to the main memory, the caches of other processors will contain invalid data. Even with the write-through technique, information integrity is not guaranteed, since the changes made need to be duplicated not only in main memory, but also in all cache blocks containing the original copies of this data.

Massive Parallel Processing is a class of architectures for parallel computing systems. The peculiarity of the architecture is that the memory is physically separated. The system is built from separate nodes containing a processor, a local OP bank, communication processors or network adapters, sometimes hard disks and/or other input/output devices.

Only processors from the same node have access to the SP bank of a given node. Nodes are connected by special communication channels. The user can determine the logical number of the processor to which he is connected, and organize the exchange of messages with other processors. MPP machines use two operating system options:

● In one, the full-fledged OS runs only on the host machine, while each node runs a heavily stripped-down version of the OS that runs the branch of the parallel application that resides on it.

● In the second option, each module runs a complete, most often UNIX-like OS, which is installed separately.

SMP architecture

Amdahl's law(English) Amdahl's law, sometimes also Amdahl-Ware law) - illustrates the limitation of growth in the performance of a computing system with an increase in the number of computers. Gene Amdals formulated the law in 1967, having discovered a limitation on the growth of productivity when paralleling calculations, which is simple in essence, but insurmountable in content: “In the case when a task is divided into several parts, the total time of its execution on a parallel system cannot be less than the execution time of the longest fragment." According to this law, the acceleration of program execution due to the parallelization of its instructions on a set of calculators is limited by the time required to execute its sequential instructions.

mathematical expression

Suppose we need to solve some computational problem. Suppose that its algorithm is such that the share of the total amount of calculations can be obtained only by sequential calculations, and, accordingly, the share can be ideally parallelized (that is, the calculation time will be inversely proportional to the number of nodes involved). Then the acceleration that can be obtained on a computing system of processors, compared to a single-processor solution, will not exceed the value


The table shows how many times faster the program will be executed with the proportion of sequential calculations when using processors. . So, if half of the code is sequential, then the total gain will never exceed two.

]Ideological value

Amdahl's law shows that the increase in computational efficiency depends on the problem algorithm and is bounded from above for any problem with . Not for every task it makes sense to increase the number of processors in a computer system.

Moreover, if we take into account the time required for data transfer between the nodes of the computing system, then the dependence of the computation time on the number of nodes will have a maximum. This imposes a limitation on the scalability of the computing system, that is, it means that from a certain point on, adding new nodes to the system will increase task calculation time.

Rice. 1. a) - the structure of the multiprocessor; b) – the structure of a multicomputer.


In multiprocessors, the address space of all processors is the same. This means that if the same variable occurs in the programs of several processors of a multiprocessor, then these processors will access one physical cell of the shared memory to get or change the value of this variable. This circumstance has both positive and negative consequences.

On the one hand, there is no need to physically move data between switching programs, which eliminates the time spent on interprocessor exchange.

On the other hand, since the simultaneous access of several processors to common data can lead to incorrect results, systems for synchronizing parallel processes and ensuring memory coherence are needed. Because processors need to access shared memory very frequently, the bandwidth requirements of the communication medium are extremely high.

The latter circumstance limits the number of processors in multiprocessors to a few dozen. The acuteness of the problem of access to shared memory can be partially removed by dividing the memory into blocks that allow you to parallelize memory accesses from different processors.

We note one more advantage of multiprocessors - a multiprocessor system operates under the control of a single copy of the operating system (usually UNIX-like) and does not require individual configuration of each processor node.

Homogeneous multiprocessors with equal (symmetrical) access to shared RAM are usually called SMP systems (systems with symmetrical multiprocessor architecture). SMP systems emerged as an alternative to expensive multiprocessor systems based on vector-pipeline processors and vector-parallel processors (see Fig. 2).


Due to the simplicity of their architecture, multicomputers are currently the most widely used. Multicomputers do not have shared memory. Therefore, interprocessor exchange in such systems is usually carried out through a communication network using message passing.

Each processor in a multicomputer has an independent address space. Therefore, the presence of a variable with the same name in the programs of different processors leads to access to physically different cells of the own memory of these processors. This circumstance requires the physical movement of data between switching programs in different processors. Most often, the main part of the calls is made by each processor to its own memory. Therefore, the requirements for the switching environment are relaxed. As a result, the number of processors in multicomputer systems can reach several thousand, tens of thousands, and even hundreds of thousands.

The peak performance of the largest shared memory systems is lower than the peak performance of the largest distributed memory systems; the cost of systems with shared memory is higher than the cost of similar systems with distributed memory.

Homogeneous multicomputers with distributed memory are called computing systems with massively parallel architecture(MPP systems) - see fig.2.

Something in between SMP systems and MPP systems are NUMA systems.

Cluster systems (computing clusters).

Cluster systems(computing clusters) are a cheaper version of MPP systems. A computing cluster consists of a set of personal computers or workstations) connected by a local network as a communication medium. Computational clusters are considered in detail later.

Rice. 2. Classification of multiprocessors and multicomputers.

