How does a computer processor work? Principle of operation. The device of the processor and its purpose What devices does the processor consist of?

The processor is the main computer device that performs logical and arithmetic operations and controls all computer components. The processor is a miniature thin rectangular silicon wafer on which a huge number of transistors are placed that implement all the functions performed by the processor. The silicon wafer is very fragile, and since any damage to it will lead to failure of the processor, it is placed in a plastic or ceramic case.

1. Introduction 2. Processor core 2.1. How the processor core works 2.2. Ways to improve processor core performance 2.2.1. Pipelining 2.2.2. Superscalarity 2.2.3. Parallel data processing 2.2.4. Hyper-threading technology 2.2.5. Turbo Boost technology. 2.2.6. Efficiency of command execution. 2.3 Ways to reduce processor core power consumption 3. Cache memory

1. Introduction.

A modern processor is a complex and high-tech device that includes all the latest achievements in the field of computing and related fields of science.

Most modern processors consist of:

one or more cores that execute all instructions;

several levels of cache memory (usually 2 or three levels), accelerating the interaction of the processor with RAM;

RAM controller;

system bus controller (DMI, QPI, HT, etc.);

And is characterized by the following parameters:

type of microarchitecture;

clock frequency;

set of commands to be executed;

the number of cache memory levels and their volume;

system bus type and speed;

size of processed words;

the presence or absence of a built-in memory controller;

type of supported RAM;

addressable memory volume;

the presence or absence of a built-in graphics core;

energy consumption.

A simplified block diagram of a modern multi-core processor is presented in Figure 1.

Let's start our review of the processor design with its main part - the core.

2. Processor core.

The processor core is its main part, containing all functional blocks and performing all logical and arithmetic operations.

Figure 1 shows a block diagram of the processor core. As can be seen in the figure, each processor core consists of several functional blocks:

instruction fetch block;

instruction decoding blocks;

data sampling blocks;

control unit;

instruction execution blocks;

blocks for saving results;

block of work with interruptions;

set of registers;

program counter.

Instruction fetch block reads instructions at the address specified in the program counter. Typically, it reads several instructions per clock cycle. The number of read instructions is determined by the number of decoding blocks, since it is necessary to load the decoding blocks as much as possible at each cycle of operation. In order for the instruction fetch unit to work optimally, the processor core has a branch predictor.

Transition predictor attempts to determine what sequence of commands will be executed after a transition has been made. This is necessary in order to load the processor core pipeline as much as possible after a conditional jump.

Decoding blocks, as the name implies, are blocks that deal with decoding instructions, i.e. determine what the processor needs to do and what additional data is needed to execute the instruction. This task is very difficult for most modern commercial processors built on the CISC concept. The fact is that the length of instructions and the number of operands are not fixed, and this greatly complicates the life of processor developers and makes the decoding process a non-trivial task.

Often, individual complex instructions must be replaced with microcode - a series of simple instructions that collectively perform the same action as one complex instruction. The microcode set is flashed into ROM built into the processor. In addition, microcode simplifies processor development, since there is no need to create complex kernel blocks to execute individual commands, and fixing microcode is much easier than fixing an error in the functioning of the block.

Modern processors usually have 2-4 instruction decoding blocks; for example, in Intel Core 2 processors, each core contains two such blocks.

Data sampling blocks fetch data from cache memory or RAM necessary to execute current instructions. Typically, each processor core contains several data sampling blocks. For example, Intel Core processors use two data samples for each core.

Control block Based on decoded instructions, it controls the operation of instruction execution blocks, distributes the load between them, and ensures timely and correct execution of instructions. This is one of the most important blocks of the processor core.

Instruction execution blocks include several different types of blocks:

ALU – arithmetic logic unit;

FPU – device for performing floating point operations;

Blocks for processing instruction set expansion. Additional instructions are used to speed up the processing of data streams, encryption and decryption, video encoding, and so on. To do this, additional registers and sets of logic are introduced into the processor core. At the moment, the most popular extensions to instruction sets are:

MMX (Multimedia Extensions) is a set of instructions developed by Intel to speed up the encoding and decoding of streaming audio and video data;

SSE (Streaming SIMD Extensions) is a set of instructions developed by Intel to perform the same sequence of operations on a set of data while parallelizing the computing process. Command sets are constantly being improved, and at the moment there are revisions: SSE, SSE2, SSE3, SSSE3, SSE4;

ATA (Application Targeted Accelerator) is a set of instructions developed by Intel to speed up the operation of specialized software and reduce power consumption when working with such programs. These instructions can be used, for example, when calculating checksums or searching data;

3DNow is an instruction set developed by AMD to expand the capabilities of the MMX instruction set;

AES (Advanced Encryption Standard) is a set of instructions developed by Intel to speed up the operation of applications that use data encryption using the same algorithm.

Results saving block ensures that the result of instruction execution is recorded in RAM at the address specified in the instruction being processed.

Interrupt block. Working with interruptions is one of the most important tasks of the processor, allowing it to respond to events in a timely manner, interrupt the progress of the program and perform the actions required of it. Thanks to the presence of interrupts, the processor is capable of pseudo-parallel operation, i.e. to so-called multitasking.

Interrupts are handled as follows. The processor checks for an interrupt request before starting each cycle. If there is an interrupt to handle, the processor stores on the stack the address of the instruction it was supposed to execute and the data received since the last instruction was executed, and proceeds to execute the interrupt service function.

After the interrupt processing function finishes executing, the data stored on it is read from the stack, and the processor resumes executing the restored task.

Registers– ultra-fast RAM (access to registers is several times faster than access to cache memory) of small volume (several hundred bytes), included in the processor, for temporary storage of intermediate results of instruction execution. Processor registers are divided into two types: general purpose registers and special registers.

General purpose registers are used when performing arithmetic and logical operations, or specific operations of additional instruction sets (MMX, SSE, etc.).

Special purpose registers contain system data necessary for the processor to operate. Such registers include, for example, control registers, system address registers, debugging registers, etc. Access to these registers is strictly regulated.

Program counter– a register containing the address of the instruction that the processor will begin executing at the next clock cycle.

Hello dear readers. Today we will show you what a processor consists of from the inside. Many users, of course, have had experience installing a processor on a motherboard, but not many know what it looks like from the inside. We will try to explain to you in a fairly simple language so that it is understandable, but at the same time without omitting details. Before you start talking about the components of the processor, you can get acquainted with the very interesting Russian Elbrus prototype.

Many users believe that the processor looks exactly as shown in the picture.

However, this is the entire assembly, which consists of smaller and vital parts. Let's take a look at what the processor consists of from the inside. The processor includes:

The figure above, number 1, shows a protective cover that provides mechanical protection against the ingress of dust and other small particles. The cover is made of a material that has a high thermal conductivity coefficient, which allows you to remove excess heat from the crystal, thereby ensuring the normal temperature range of the processor.

Number 2 shows the “brain” of the processor and the computer as a whole - this is a crystal. It is he who is considered the most “smart” element of the processor, which performs all the tasks assigned to it. You can see that a thin layer of microcircuit is applied to the crystal, which ensures the specified functioning of the processor. Most often, processor crystals are made of silicon: this is due to the fact that this element has quite complex molecular bonds that are used in the formation of internal currents, which ensures the creation of multi-threaded information processing.

Number 3 shows the textolite platform to which everything else was attached: the crystal and the lid. This platform also acts as a good conductor, which ensures good electrical contact with the crystal. On the back side of the platform, in order to increase electrical conductivity, there are many points made of precious metal (sometimes even gold is used).

Here's what electrically conductive points look like using an Intel processor as an example.

The shape of the contacts depends on which socket is on the motherboard. It also happens that instead of the dots on the back of the platform, you can see pins that perform the same role. Typically, for Intel family of processors, the pins are located on the motherboard itself. In this case, points will be located on the substrate (aka platform). For the AMD processor family, the pins are located directly on the substrate itself. Such processors look like this:

Now let's look at the method of attaching all the parts. In order for the lid to be firmly held on the substrate, it is “seated” using a special glue-sealant that is resistant to high temperatures. This allows the structure to be in permanent connection without violating its integrity.

To ensure that the crystal does not overheat, a special gasket 1 is applied to it, on top of which, in turn, thermal paste 2 is applied, which ensures effective heat removal to the lid. The lid is also “lubricated” on the inside with thermal paste.

Let's now see what a dual-core processor looks like. The core is a separate, functionally independent crystal, which is installed in parallel on the substrate. It looks like this.

Thus, 2 cores installed side by side increase the total processor power. However, if you see 2 crystals standing next to each other, this will not always mean that you have a dual-core processor. Some sockets have 2 crystals installed, one of which is responsible for the arithmetic-logical part, and the other for graphics processing (a certain built-in graphics processor). This comes in handy in cases where you have a built-in video card that is not powerful enough to handle, for example, some game. In these cases, the lion's share of calculations is taken over by the graphics part of the central processor. This is what a processor with a graphics core looks like.

So, friends, we figured out what the processor consists of. It has now become clear that all devices included in the processor play an important and indispensable role for high-quality work. Don’t forget to comment on articles on our site, subscribe to our newsletter and learn a lot of interesting things. Your opinion is important to us!

Computer CPU (Central Processor Unit, CPU) carries out basic actions to execute commands. There are several components in it:

Elements that make up the processor:

command decoder;
arithmetic logic unit (ALU), which performs operations on the operands;
registers for storing data, addresses and service information;
device for generating (calculating) addresses of operands;
control device .

- Control device (CU) - controls the processor for sequential sampling, decoding and execution of program commands stored in memory. The control unit generates a time diagram of the operation of all processor nodes. Some registers can also be classified as control units.

- Device for generating operand addresses - calculates the address at which the next access to the memory section containing the operand will occur.

- ALU- a combinational logical device that has two (multi-bit) inputs (to which two words of similar operands are supplied); the ALU output generates the result of the operations that the processor performs on the operands, such as addition, multiplication, etc. The minimum set of operations (Von Neumann machine) that an ALU must perform includes addition, inversion and logical "AND" operations; all other operations can be derived from these.

- Registers . The minimum set of registers required for the processor to function includes the following registers:

Accumulator - stores the result of operations, often doubled. length compared to the processor bit capacity (for storing the results of multiplication and shift operations).

Program counter - contains the address of the next command.

Address register - contains the address of the operand, used for indirect addressing.

Flag register (status and control) - contains code characterizing the results of previous operations, as well as information about the current state computer central processing unit.

Drawing. MC68HC05 processor registers

- Register file - a set of registers of the same type.

Each processor has its own set of registers. Two groups of processors can be distinguished: with general-purpose registers and with a specialized set of registers (Example IntelX86). In the first case, all registers in the register file are the same and can be used arbitrarily in instructions. In the second, each register is assigned its own function, and the use of registers in commands is specified in the format of each command. However, processors with register files require large hardware costs to organize communication between registers.

Drawing. Registers of MC68xxx and Intel X86 processors

Processor model for a programmer - a set of registers, command formats, addressing methods, memory organization, etc. You can consider the register model of the processor - a set of registers, their formats and ways of working with them.

At the user level (general purpose and flag registers)

At the system level (processor control registers and memory organization, elements of interrupt organization and direct memory access (DAM)).

Computational core (Core) - this term refers to the set of processor elements necessary to execute a command.

Peripherals - devices external to the processor-memory connection.

I/O devices - part of peripheral devices designed to connect the computer with the “outside world”.

The design and principle of operation of the processor

1. Introduction.

A modern processor is a complex and high-tech device that includes all the latest achievements in the field of computing and related fields of science.

Most modern processors consist of:

one or more cores that execute all instructions;
several levels of cache memory (usually 2 or three levels), accelerating the interaction of the processor with RAM;
RAM controller;
system bus controller (DMI, QPI, HT, etc.);

And is characterized by the following parameters:

type of microarchitecture;
clock frequency;
set of commands to be executed;
the number of cache memory levels and their volume;
system bus type and speed;
size of processed words;
the presence or absence of a built-in memory controller;
type of supported RAM;
addressable memory volume;
the presence or absence of a built-in graphics core;
energy consumption.

A simplified block diagram of a modern multi-core processor is presented in Figure 1.

Let's start our review of the processor design with its main part - the core.

2. Processor core.

The processor core is its main part, containing all functional blocks and performing all logical and arithmetic operations.

Figure 1 shows a block diagram of the processor core. As can be seen in the figure, each processor core consists of several functional blocks:

instruction fetch block;
instruction decoding blocks;
data sampling blocks;
control unit;
instruction execution blocks;
blocks for saving results;
block of work with interruptions;
ROM containing microcode;
set of registers;
program counter.

Modern processors usually have 2-4 instruction decoding blocks; for example, in Intel Core 2 processors, each core contains two such blocks.

Instruction execution blocks include several different types of blocks:

ALU – arithmetic logic unit;

FPU – device for performing floating point operations;

MMX (Multimedia Extensions) is a set of instructions developed by Intel to speed up the encoding and decoding of streaming audio and video data;

3DNow is an instruction set developed by AMD to expand the capabilities of the MMX instruction set;

AES (Advanced Encryption Standard) is a set of instructions developed by Intel to speed up the operation of applications that use data encryption using the same algorithm.

Results saving block ensures that the result of instruction execution is recorded in RAM at the address specified in the instruction being processed.

After the interrupt processing function finishes executing, the data stored on it is read from the stack, and the processor resumes executing the restored task.

General purpose registers are used when performing arithmetic and logical operations, or specific operations of additional instruction sets (MMX, SSE, etc.).

Program counter– a register containing the address of the instruction that the processor will begin executing at the next clock cycle.

2.1 The principle of operation of the processor core.

The operating principle of the processor core is based on a cycle described by John von Neumann in 1946. In a simplified form, the stages of the processor core cycle can be represented as follows:

1. The instruction fetch unit checks for interrupts. If there is an interrupt, then the data of the registers and the program counter are written to the stack, and the address of the interrupt handler command is written to the program counter. At the end of the interrupt handling function, the data from the stack will be restored;

2. The instruction fetch unit reads the address of the instruction to be executed from the program counter. The command is read from cache memory or RAM at this address. The received data is transmitted to the decoding unit;

3. The command decoding unit decrypts the command, if necessary using the microcode recorded in the ROM to interpret the command. If this is a jump command, then the jump address is written to the program counter and control is transferred to the instruction fetch block (point 1), otherwise the program counter is increased by the instruction size (for a processor with an instruction length of 32 bits - by 4) and transfers control to the data fetch block ;

4. The data sampling unit reads the data required to execute the command from the cache memory or RAM and transfers control to the scheduler;

5. The control block determines which instruction execution block to process the current task and transfers control to this block;

6. Instruction execution blocks perform the actions required by the command and transfer control to the results saving block;

7. If it is necessary to save the results in RAM, the results storage block performs the required actions and transfers control to the instruction fetch block (point 1).

The loop described above is called a process (that's why a processor is called a processor). The sequence of commands executed is called a program.

The speed of transition from one stage of the cycle to another is determined by the clock frequency of the processor, and the operating time of each stage of the cycle and the time spent on the complete execution of one instruction are determined by the design of the processor core.

2.2. Ways to improve processor core performance.

Increasing the performance of the processor core by raising the clock frequency has a strict limitation. Increasing the clock frequency entails an increase in processor temperature, power consumption and a decrease in the stability of its operation and service life.

Therefore, processor developers are using various architectural solutions to increase processor performance without increasing the clock speed.

Let's look at the main ways to increase processor performance.

2.2.1. Pipelining.

Each instruction executed by the processor sequentially passes through all kernel blocks, each of which carries out its own part of the actions necessary to execute the instruction. If you start processing a new instruction only after completing work on the first instruction, then most of the processor core blocks will be idle at any given time, and, consequently, the processor’s capabilities will not be fully used.

Let's consider an example in which the processor will execute a program consisting of five instructions (K1–K5), without using the principle of pipelining. To simplify the example, we assume that each processor core block executes an instruction in 1 clock cycle.

Authors of the article: Gvindzhiliya Grigory and Pashchenko Sergey

So you	Sample instructions	Decoding instructions	Data sampling	Executing instructions	Saving the result
1	K1	-	-	-	-
2	-	K1	-	-	-
3	-	-	K1	-	-
4	-	-	-	K1	-
5	-	-	-	-	K1
6	K2	-	-	-	-
7	-	K2	-	-	-
8	-	-	K2	-	-
9	-	-	-	K2	-
10	-	-	-	-	K2
11	K3	-	-	-	-
12	-	K3	-	-	-
13	-	-	K3	-	-
14	-	-	-	K3	-
15	-	-	-	-	K3
16	K4	-	-	-	-
17	-	K4	-	-	-
18	-	-	K4	-	-
19	-	-	-	K4	-
20	-	-	-	-	K4
21	K5	-	-	-	-
22	-	K5	-	-	-
23	-	-	K5	-	-
24	-	-	-	K5	-
25	-	-	-	-	K5

As can be seen from the table, the processor needed 25 clock cycles to execute five instructions. At the same time, in each cycle, four of the five blocks of the processor core were idle, i.e. The processor used only 20% of its potential. Naturally, in real processors everything is more complicated. Different processor blocks solve problems of different complexity. The instructions themselves also vary in complexity. But in general the situation remains the same.

To solve this problem, in all modern processors, the execution of instructions is built on the principle of a pipeline, that is, as kernel blocks are released, they are loaded by processing the next instruction, without waiting until the previous instruction is fully executed.

Let's look at an example of executing the same program, consisting of five instructions, but using the principle of pipelining.

So you	Sample instructions	Decoding instructions	Data sampling	Executing instructions	Saving the result
1	K1	-	-	-	-
2	K2	K1	-	-	-
3	K3	K2	K1	-	-
4	K4	K3	K2	K1	-
5	K5	K4	K3	K2	K1
6	-	K5	K4	K3	K2
7	-	-	K5	K4	K3
8	-	-	-	K5	K4
9	-	-	-	-	K5

The same program was executed in 9 clock cycles, which is almost 2.8 times faster than when working without a pipeline. As can be seen from the table, the maximum processor load was obtained at cycle 5. At this moment, all blocks of the processor core were used. And from the first to the fourth cycle, inclusive, the conveyor was filled.

Since the processor executes instructions continuously, then, ideally, it could be 100% busy, and the longer the pipeline, the greater the performance gain would be obtained. But in practice this is not the case.

Firstly, the actual command stream processed by the processor is inconsistent. It often contains transitions. At the same time, until the conditional jump command is processed completely, the pipeline will not be able to start executing a new command, since it does not know at what address it is located.

After a conditional transition, the conveyor has to be refilled. And the longer the pipeline, the longer it takes. As a result, the productivity gain from introducing a conveyor is reduced.

To reduce the influence of conditional branches on the operation of the pipeline, conditional branch prediction blocks are introduced into the processor core. The main task of these blocks is to determine when the conditional branch will be made and what commands will be executed after the conditional branch is made.

If the conditional jump can be predicted, then the execution of instructions at the new address begins before the processing of the conditional jump instruction is completed. As a result, the filling of the conveyor will not be affected.

According to statistics, the accuracy of conditional branch prediction blocks in modern processors exceeds 90%, which makes it possible to create fairly long, but at the same time well-filled pipelines.

Secondly, frequently processed instructions are interconnected, that is, one of the instructions requires the result of executing another instruction as input data.

In this case, it can be executed only after the first instruction has been completely processed. However, modern processors can analyze the code several instructions ahead and, for example, in parallel with the first instruction, process a third instruction, which does not depend in any way on the first two.

In most modern processors, the task of analyzing the relationship of instructions and drawing up the order of their processing falls on the shoulders of the processor, which inevitably leads to a decrease in its performance and an increase in cost.

However, static scheduling is becoming increasingly popular, when the order in which the program is executed by the processor is determined at the compilation stage of the program. In this case, instructions that can be executed in parallel are combined by the compiler into one long instruction in which all instructions are known to be parallel. Processors that work with such instructions are built on the VLIW (Very long instruction word) architecture.

2.2.2. Superscalarity.

Superscalarity is the architecture of the computing core, in which the most loaded blocks can be included in several copies. For example, in a processor core, an instruction fetch unit can load several decoding units at once.

In this case, blocks that perform more complex actions and work longer, due to parallel processing of several instructions at once, will not delay the entire pipeline.

However, parallel execution of instructions is possible only if these instructions are independent.

A block diagram of a pipeline core of a hypothetical processor built using the superscalar principle is shown in Figure 1. In this figure, each processor core has several decoding units, several data fetch units, and several instruction execution units.

2.2.3. Parallel data processing.

It is impossible to endlessly increase the performance of processors by increasing the clock frequency. An increase in clock frequency entails an increase in heat generation, a decrease in the service life and reliability of processors, and delays from memory access greatly reduce the effect of increasing the clock frequency. Indeed, nowadays you almost never find processors with a clock speed higher than 3.8 GHz.

Problems associated with increasing clock speeds force developers to look for other ways to increase processor performance. One of the most popular methods is parallel computing.

The vast majority of modern processors have two or more cores. Top models can contain 8 or even 12 cores, and with support for hyper-threading technology. The advantages of introducing additional cores are quite clear; we practically get several processors, each capable of independently solving its own tasks, and, of course, performance increases. However, productivity gains do not always live up to expectations.

Firstly, not all programs support distributing calculations across multiple cores. Naturally, programs can be divided between cores, so that each core runs its own set of independent programs. For example, one core runs an operating system with a set of utility programs, another core runs user programs, and so on.

But this gives a performance gain until a program appears that requires more resources than one core can provide. It's good if it supports load distribution between several cores. But at the moment, publicly available programs capable of distributing the load between 12 cores, and even in Hyper-Threading mode, can be “counted on the fingers of one hand.” Of course, I am exaggerating; there are programs optimized for multi-threaded computing, but most ordinary users do not need them. But the most popular programs, and even more so games, still “poorly” adapt to multi-core processors, especially if the number of cores is more than four.

Secondly, working with memory becomes more complicated, since there are many cores, and they all require access to RAM. A complex mechanism is required that determines the order of access of processor cores to memory and other computer resources.

Thirdly, energy consumption increases, and, consequently, heat dissipation increases and a powerful cooling system is required.

Well, fourthly, the production cost of multi-core processors is quite high, and, accordingly, the price of such processors is steep.

Despite all the shortcomings, the use of processors with 2-4 cores undoubtedly provides a significant increase in performance. However, at the moment, the use of processors with more than four cores does not always live up to the expectation. However, in the near future, the situation should change dramatically. There will definitely be many programs that support multithreading, the performance of individual cores will increase, and their price will decrease.

2.2.4. Hyper-Threading technology.

Intel Hyper-threading technology allows each processor core to perform two tasks simultaneously, essentially turning one real core into two virtual ones. This is possible due to the fact that in such kernels the state of two threads is saved at once, since the kernel has its own set of registers, its own program counter and its own interrupt unit for each thread. As a result, the operating system sees such a kernel as two separate cores, and will work with them in the same way as it would work with a dual-core processor.

However, the remaining kernel elements for both threads are common and are shared between them. In addition, when for some reason one of the threads frees elements of the pipeline, the other thread uses the free blocks.

Pipeline elements may not be used if, for example, there was a miss when accessing the cache memory and it is necessary to read data from RAM, or a branch was predicted incorrectly, or the results of processing the current instruction are awaited, or some blocks are not used at all during processing this manual, etc.

Most programs cannot fully load the processor, since some mainly use simple integer calculations, hardly using the FPU unit. Other programs, such as 3D studio, require a lot of calculations using floating point numbers, but at the same time freeing up some other execution units and so on.

In addition, almost all programs contain a lot of conditional jumps and dependent variables. As a result, the use of Hyper-threading technology can provide a significant performance boost, helping to maximize the load on the kernel pipeline.

But it's not that simple. Naturally, the performance gain will be less than from using several physical cores, since threads still use common blocks of one pipeline and are often forced to wait for the required block to be released. In addition, most processors already have several physical cores, and when using Hyper-threading technology, there may be too many virtual cores, especially if the processor contains four or more physical cores.

Since at the moment there are very few programs capable of distributing calculations over a large number of cores, in this case the result may disappoint users.

There is another serious problem with Hyper-Threading technology - conflicts that arise when instructions from different threads need the same type of blocks. A situation may arise when two similar threads work in parallel, often using the same blocks. In this case, the performance gain will be minimal.

As a result, Hyper-Threading technology is very dependent on the type of load on the processor and can give a good performance boost, or it can be practically useless.

2.2.5. Turbo Boost technology.

The performance of most modern processors at home can be slightly increased, simply put, overclocked - forced to operate at frequencies higher than the nominal, i.e. declared by the manufacturer.

The processor frequency is calculated as the system bus frequency multiplied by a certain coefficient called a multiplier. For example, a Core i7-970 processor operates with the DMI system bus at a base frequency of 133 MHz, and has a multiplier of 24. Thus, the processor core clock frequency will be: 133 MHz * 24 = 3192 MHz.

If you increase the multiplier in the BIOS settings or raise the system bus clock frequency, the processor clock frequency will increase, and, accordingly, its performance will increase. However, this process is far from safe. Overclocking may cause the processor to become unstable or even fail. Therefore, overclocking must be approached responsibly and carefully monitor the processor operating parameters.

With the advent of Turbo Boost technology, everything has become much easier. Processors with this technology can dynamically, for a short period of time, increase the clock frequency, thereby increasing their performance. At the same time, the processor controls all parameters of its operation: voltage, current, temperature, etc., preventing failures and, especially, failure. For example, the processor can disable unused cores, thereby lowering the overall temperature, and in return increase the clock speed of the remaining cores.

Since at the moment there are not very many programs that use all processor cores for data processing, especially if there are more than four of them, the use of Turbo Boost technology can significantly increase processor performance, especially when working with single-threaded applications.

2.2.6. Efficiency of command execution.

Depending on the types of instructions processed and the method of their execution, processors are divided into several groups:

for classic CISC processors;
on RISC processors with a reduced instruction set;
for MISC processors with a minimum instruction set;
on VLIW processors with a set of ultra-long instructions.

CISC (Complex instruction set computer)- These are processors with a complex set of instructions. The CISC architecture is characterized by:

complex and multifaceted instructions;
a large set of different instructions;
non-fixed length of instructions;
variety of addressing modes.

Historically, processors with CISC architecture appeared first, and their appearance was due to the general trend in the development of the first computers. They tried to make computers more functional and at the same time easier to program. Naturally, it was initially more convenient for programmers to have a wide set of commands than to implement each function with a whole separate subroutine. As a result, the volume of programs was greatly reduced, and with it the complexity of programming.

However, this situation did not last long. Firstly, with the advent of high-level languages, the need for direct programming in machine code and assembly language disappeared, and, secondly, over time, the number of different commands increased greatly, and the instructions themselves became more complex. As a result, most programmers primarily used a specific set of instructions, virtually ignoring the more complex instructions.

As a result, programmers no longer had much benefit from a wide set of instructions, since program compilation became automatic, and the processors themselves processed complex and varied instructions slowly, mainly due to problems with decoding them.

In addition, processor developers debugged new complex instructions less, since it was a labor-intensive and complex process. As a result, some of them may contain errors.

And, naturally, the more complex the instructions, the more actions they perform, the more difficult it is to parallelize their execution, and, accordingly, the less efficiently they load the processor pipeline.

However, by this time a huge number of programs had already been developed for processors with CISC architecture, so it was economically unprofitable to switch to a fundamentally new architecture, even if it gave a gain in processor performance.

Therefore, a compromise was accepted, and CISC processors, starting with the Intel486DX, began to be produced using a RISC core. That is, immediately before execution, complex CISC instructions are converted into a simpler set of internal RISC instructions. To do this, they use sets of microinstructions written in ROM located inside the processor core - a series of simple instructions that together perform the same actions as one complex instruction.

RISC (Reduced Instruction Set Computer)– processors with a reduced instruction set.

The RISC processor concept favors short, simple, and standardized instructions. As a result, such instructions are easier to decode and execute, and, consequently, the processor design also becomes simpler, since complex blocks are not required to execute non-standard and multifunctional instructions. As a result, the processor becomes cheaper, and it becomes possible to further increase its clock frequency by simplifying the internal structure and reducing the number of transistors, or reduce power consumption.

Also, simple RISC instructions are much easier to parallelize than CISC instructions, and, therefore, it becomes possible to load the pipeline more, introduce additional instruction processing units, etc.

Processors built on the RISC architecture have the following main features:

fixed instruction length;
a small set of standardized instructions;
a large number of general purpose registers;
lack of microcode;
lower power consumption compared to CISC processors of similar performance;
simpler internal structure;
fewer transistors compared to CISC processors of similar performance;
absence of complex specialized blocks in the processor core.

As a result, although RISC processors require more instructions to complete the same task compared to CISC processors, they generally provide higher performance. First, executing a single RISC instruction takes much less time than executing a CISC instruction. Secondly, RISC processors make greater use of parallel processing capabilities. Third, RISC processors can have higher clock speeds than CISC processors.

However, despite the obvious advantage of RISC, processors have not become as widespread as CISC. True, this is mainly not due to the fact that in some respects they could be worse than CISC processors. They are no worse. The fact is that CISC processors appeared first, and software for CISC processors is incompatible with RISC processors.

As a result, it is extremely unprofitable to rewrite all programs that have already been developed, debugged and used by a huge number of users. This is how it turned out that now we are forced to use CISC processors. True, as I already said, the developers have found a compromise solution to this problem, and for a very long time CISC processors have been using a RISC core and replacing complex instructions with microprograms. This allowed the situation to be somewhat smoothed out. But still, RISC processors in most respects outperform even CISC processors with a RISC core.

MISC (Minimal Instruction Set Computer)– further development of the RISС architecture, based on even greater simplification of instructions and a reduction in their number. So, on average, MISC processors use 20-30 simple instructions. This approach made it possible to further simplify the processor design, reduce power consumption and make maximum use of parallel data processing capabilities.

VLIW (Very long instruction word)– a processor architecture that uses long instructions containing several operations at once, combined by the compiler for parallel processing. In some processor implementations, instructions can be as long as 128 or even 256 bits.

The VLIW architecture is a further enhancement of the RISC and MISC architectures with deep parallelism.

If in RISC processors the processor itself was responsible for organizing parallel data processing, at the same time spending part of its resources on analyzing instructions, identifying dependencies and predicting conditional branches (and, often, the processor could make mistakes, for example, in predicting conditional branches, thereby introducing serious delays into processing instructions, or viewing the program code in insufficient depth to identify independent operations that could be executed in parallel), then in VLIW processors the task of optimizing parallel operation was assigned to the compiler, which was not limited in time or resources and could analyze all a program for composing optimal code for the processor.

As a result, the VLIW processor benefited not only from the elimination of overhead costs for organizing parallel data processing, but also received a performance increase due to a more optimal organization of parallel execution of instructions.

In addition, the design of the processor was simplified, since some blocks responsible for analyzing dependencies and organizing parallelization of instruction processing were simplified or completely eliminated, and this, in turn, led to a reduction in power consumption and the cost of processors.

However, even the compiler has a hard time analyzing the code and organizing its parallelization. Often the program code was highly interdependent, and, as a result, the compiler had to insert empty commands into the instructions. Because of this, programs for VLIW processors could be much longer than similar programs for traditional architectures.

The first VLIW processors appeared in the late 1980s and were developed by Cydrome. Processors with this architecture also include TriMedia processors from Philips, the DSP C6000 family from Texas Instruments, Elbru?s 2000 - a Russian-made processor developed by MCST with the participation of MIPT students, etc. Support for long instructions with explicit parallelism is also available in processors of the family Itanium.

2.3. Ways to reduce processor power consumption.

No less important than performance, a parameter such as power consumption is also important for a processor. The issue of energy consumption has become especially acute now, when there is a real boom in the popularity of portable devices.

Our life can no longer be imagined comfortable without the use of laptops, tablets and smartphones. However, the only thing that overshadows this trend is the battery life of such devices. So, laptops, on average, can work autonomously for 3-5 hours, tablets - a little more, smartphones can already last almost a day at full load, and not all of them. But all this is extremely small for comfortable work with them.

The battery life of these devices directly depends on their power consumption, and a significant portion of the power consumption comes from the processor. Various methods and technologies are used to reduce processor power consumption. Let's look at the most popular of them.

The easiest way to reduce the power consumption and heat dissipation of a processor is to reduce its clock frequency and voltage, since the power consumption of a processor is proportional to the square of its operating voltage and proportional to the clock frequency. Reducing the voltage has the most beneficial effect on energy consumption. However, as the voltage decreases, sooner or later the clock frequency also decreases, which will naturally lead to a decrease in performance.

However, power consumption is often a more critical operating parameter, and some performance degradation is acceptable. Thus, most mobile and embedded processors have clock speeds and operating voltages much lower than their desktop counterparts.

But manufacturers do not always set the optimal combination of voltage and clock frequency. Many mobile processors with a set clock speed could operate at lower voltages, which would significantly extend the battery life of a laptop computer.

To obtain the optimal ratio of performance to power consumption, it is necessary to select a voltage at which the processor will operate stably at a given clock frequency.

The clock frequency is determined based on the user's needs, then the minimum operating voltage is selected for it by gradually reducing the voltage and testing the processor under load.

There are also less drastic ways to solve this problem.

For example, technology EIST (Enhanced Intel SpeedStep Technology) allows you to dynamically change the power consumption of the processor by changing the processor clock frequency and voltage. The clock frequency changes due to a decrease or increase in the multiplication factor.

I already mentioned the multiplication factor above, but I’ll repeat it. The processor clock speed is calculated as the system bus clock speed multiplied by a certain factor called the multiplication factor. Decreasing or increasing this coefficient leads to a decrease or increase in the processor clock frequency and a decrease or increase in operating voltage.

In cases where the processor is not fully utilized, its clock speed can be reduced by reducing the multiplier. As soon as the user requires more computing resources, the multiplication factor will be increased, up to its nominal value. Thus, it is possible to slightly reduce energy consumption.

A similar technology for reducing power consumption, based on dynamic changes in voltage and clock frequency, depending on the load on the processor, is also used by AMD, it is called - Cool'n'Quiet.

In the vast majority of cases, computers are either completely idle or are used only to a fraction of their capabilities. For example, watching a movie or typing text does not require the enormous computing capabilities that modern processors have. Moreover, these powers are not needed when the computer is idle, when the user has walked away or simply decided to take a short break. By reducing the processor clock frequency and its voltage at such moments, you can get a very serious increase in energy savings.

EIST can be configured using the BIOS and operating system software to set custom power management profiles to balance processor performance and power consumption.

Naturally, developers are trying to optimize the processor structure itself to reduce power consumption and allow the processor to operate at ultra-low voltages. However, this task is extremely complex and time-consuming. Prototype processors have already come very close to the threshold of the minimum operating voltage and are already having difficulty distinguishing the voltage of a logical one from a logical zero. However, despite this, processor developers, including engineers from Intel Corporation, promise to reduce the power consumption of modern processors by as much as 100 times over the next ten years. Well, let's wait and see what they come up with.

3. Cache memory.

Despite all the technologies and tricks of the developers, processor performance still directly depends on the speed of fetching commands and data from memory. And even if the processor has a balanced and well-thought-out pipeline, uses Hyper-Threading technology, and so on, but does not provide the proper speed for retrieving data and commands from memory, then, as a result, the overall performance of the computer will not meet your expectations.

Therefore, one of the most important parameters of the processor device is the cache memory, which is designed to reduce the time of fetching instructions and data from the main RAM and acts as an intermediate buffer with fast access between the processor and the main RAM.

Cache memory is built on the basis of expensive SRAM memory (static random access memory), which provides access to memory cells much faster than to DRAM memory cells (dynamic random access memory), on the basis of which RAM is built. In addition, SRAM memory does not require constant regeneration, which also increases its performance. However, we will look at the design of SRAM, DRAM and other types of memory in more detail in the next article, and now we will look in more detail at the principle of operation and design of cache memory.

Cache memory is divided into several levels. Modern processors usually have three levels, and some top processor models sometimes have four levels of cache memory.

Higher level caches are always larger and slower than lower level caches.

The fastest and smallest cache memory is the Level 1 cache. It usually operates at the processor frequency, has a capacity of several hundred kilobytes, and is located in close proximity to the data and instruction sampling units. Moreover, it can be single (Princeton architecture) or divided into two parts (Harvard architecture): command memory and data memory. Most modern processors use a shared L1 cache, as this allows data to be fetched simultaneously with instruction fetching, which is extremely important for pipeline operation.

Second-level cache memory is slower (access time, on average, 8-20 processor cycles), but has a capacity of several megabytes.

Level 3 cache memory is even slower, but has a relatively large capacity. There are processors with third-level cache memory larger than 24 MB.

In multi-core processors, the last level of cache memory is usually made common to all cores. Moreover, depending on the load on the cores, the amount of last-level cache memory allocated to the core can dynamically change. If a core has a high load, then more cache memory is allocated to it, by reducing the amount of cache memory for less loaded cores. Not all processors have this capability, only those that support Smart Cache technology (for example, Intel Smart Cache or AMD Balanced Smart Cache).

Lower level cache memory is usually individual for each processor core.

We've looked at how cache memory works, let's now figure out how it works.

The processor reads data from the main RAM and stores it in cache memory at all levels, replacing data that has been accessed for a long time and most rarely.

The next time the processor needs the same data, it will be read not from the main RAM, but from the first level cache memory, which is much faster. If the processor does not access this data for a long time, then it will be gradually ousted from all levels of cache memory, first from the first, since it is the smallest in volume, then from the second, and so on. But, even if this data remains only in the third level of cache memory, accessing it will still be faster than accessing main memory.

However, the more levels of cache memory, the more complex the algorithm for replacing obsolete data and the more time is spent reconciling data in all levels of cache memory. As a result, the gain from the speed of cache memory quickly disappears. In addition, SRAM memory is very expensive, and with large volumes, and, as you remember, each new level of cache memory must be larger than the previous one, the price-quality ratio quickly decreases, which has an extremely negative impact on the competitiveness of the processor. Therefore, in practice they do not make more than four levels of cache memory.

The cache memory situation becomes even more complicated in multi-core processors, where each core contains its own cache memory. It is necessary to introduce additional synchronization of data stored in the cache memory of different cores. For example, the same block of data from the main RAM was entered into the cache memory of the first and second processor cores. The first processor then modified this memory block. It turns out that the cache memory of the second processor already contains outdated data and needs to be updated, and this is an additional load on the cache memory, which leads to a decrease in the overall performance of the processor. This situation is more complicated the more cores there are in the processor, the more levels of cache memory and the larger their volume.

But, despite such difficulties in working with cache memory, its use provides a clear increase in operating speed without a significant increase in the cost of the computer. And until RAM is invented that can compete in speed with SRAM memory and in price with DRAM memory, a hierarchical organization of RAM will be used using several levels of cache memory.

Perhaps, this is where we will finish the review of the processor device, since a review of system buses and the principle of their operation was given in the article “Design and purpose of the motherboard”, and a description of the main RAM controller, often included in the processor, types of RAM and principles of its operation will be in the next article.

It is very difficult to surprise the modern consumer of electronics. We are already accustomed to the fact that our pocket is rightfully occupied by a smartphone, a laptop is in our bag, a smart watch is obediently counting steps on our hand, and headphones with an active noise reduction system are caressing our ears.

It's a funny thing, but we are used to carrying with us not one, but two, three or more computers at once. After all, this is exactly what you can call a device that has CPU. And it doesn’t matter at all what a particular device looks like. A miniature chip, which has overcome a turbulent and rapid development path, is responsible for its operation.

Why did we bring up the topic of processors? It's simple. Over the past ten years, there has been a real revolution in the world of mobile devices.

There is only a 10 year difference between these devices. But Nokia N95 seemed like a space device to us back then, and today we look at ARKit with a certain distrust

But everything could have turned out differently and the battered Pentium IV would have remained the ultimate dream of the average buyer.

We tried to avoid complex technical terms and tell how the processor works and find out which architecture is the future.

1. How it all started

The first processors were completely different from what you can see when you open the lid of your PC's system unit.

Instead of microcircuits in the 40s of the XX century, they used electromechanical relays, supplemented with vacuum tubes. The lamps acted as a diode, the state of which could be regulated by lowering or increasing the voltage in the circuit. Such structures looked like this:

To operate one gigantic computer, hundreds, sometimes thousands of processors were needed. But at the same time, you would not be able to run even a simple editor like NotePad or TextEdit from the standard Windows and macOS set on such a computer. The computer would simply not have enough power.

2. The emergence of transistors

First field effect transistors appeared back in 1928. But the world changed only after the advent of the so-called bipolar transistors, opened in 1947.

In the late 1940s, experimental physicist Walter Brattain and theorist John Bardeen developed the first point-point transistor. In 1950, it was replaced by the first planar transistor, and in 1954, the well-known manufacturer Texas Instruments announced a silicon transistor.

But the real revolution came in 1959, when scientist Jean Henri developed the first silicon planar (flat) transistor, which became the basis for monolithic integrated circuits.

Yes, it's a little complicated, so let's dig a little deeper and understand the theoretical part.

3. How a transistor works

So, the task of such an electrical component as transistor is to control the current. Simply put, this little tricky switch controls the flow of electricity.

The main advantage of a transistor over a conventional switch is that it does not require human presence. Those. Such an element is capable of controlling the current independently. Plus, it works much faster than you would switching an electrical circuit on or off yourself.

You probably remember from your school computer science course that a computer “understands” human language through combinations of just two states: “on” and “off”. In the understanding of the machine, this is the state “0” or “1”.

The computer's job is to represent electric current as numbers.

And if previously the task of switching states was performed by clumsy, bulky and ineffective electrical relays, now the transistor has taken on this routine work.

Since the early 60s, transistors began to be made from silicon, which made it possible not only to make processors more compact, but also to significantly increase their reliability.

But first, let's deal with the diode

Silicon(aka Si - “silicium” in the periodic table) belongs to the category of semiconductors, which means, on the one hand, it passes current better than a dielectric, on the other, it does it worse than metal.

Whether we like it or not, to understand the work and further history of the development of processors we will have to plunge into the structure of one silicon atom. Don't be afraid, we'll make it short and very clear.

The task of the transistor is to amplify a weak signal using an additional power source.

The silicon atom has four electrons, thanks to which it forms bonds (to be precise - covalent bonds) with the same nearby three atoms, forming a crystal lattice. While most electrons are in bond, a small fraction of them are able to move through the crystal lattice. It is because of this partial transition of electrons that silicon is classified as a semiconductor.

But such a weak movement of electrons would not allow the transistor to be used in practice, so scientists decided to increase the performance of transistors by doping, or simply put, the addition of the silicon crystal lattice with atoms of elements with a characteristic arrangement of electrons.

So they began to use a 5-valent phosphorus impurity, due to which they obtained n-type transistors. The presence of an additional electron made it possible to accelerate their movement, increasing the current flow.

When doping transistors p-type Boron, which contains three electrons, became such a catalyst. Due to the absence of one electron, holes appear in the crystal lattice (acting as a positive charge), but due to the fact that electrons are able to fill these holes, the conductivity of silicon increases significantly.

Let's say we took a silicon wafer and doped one part of it with a p-type dopant and the other part with an n-type dopant. So we got diode– the basic element of the transistor.

Now the electrons located in the n-part will tend to move into holes located in the p-part. In this case, the n-side will have a slight negative charge, and the p-side will have a slight positive charge. The electric field, a barrier, formed as a result of this “gravity” will prevent further movement of electrons.

If you connect a power source to the diode in such a way that “–” touches the p-side of the plate, and “+” touches the n-side, current flow will be impossible due to the fact that holes will be attracted to the negative contact of the power source, and electrons will be attracted to positive, and the connection between the p and n side electrons will be lost due to expansion of the combined layer.

But if you connect the power with sufficient voltage the other way around, i.e. "+" from the source to the p-side, and "-" - to the n-side, the electrons placed on the n-side will be repelled by the negative pole and pushed out to the p-side, occupying holes in the p-region.

But now the electrons are attracted to the positive pole of the power supply and they continue to move through the p-holes. This phenomenon was called diode forward bias.

Diode + diode = transistor

The transistor itself can be thought of as two diodes connected to each other. In this case, the p-region (the one where the holes are located) becomes common between them and is called the “base”.

An N-P-N transistor has two n-regions with additional electrons - they are also the “emitter” and “collector” and one weak region with holes - the p-region, called the “base”.

If you connect a power supply (let's call it V1) to the n-regions of the transistor (regardless of the pole), one diode will become reverse biased and the transistor will be closed.

But, as soon as we connect another power source (let’s call it V2), setting the “+” contact to the “central” p-region (base), and the “–” contact to the n-region (emitter), some electrons will flow through again formed chain (V2), and part will be attracted by the positive n-region. As a result, electrons will flow into the collector area and the weak electrical current will be amplified.

Let's exhale!

4. So how does a computer work?

And now the most important.

Depending on the applied voltage, the transistor can be either open, or closed. If the voltage is insufficient to overcome the potential barrier (the same one at the junction of p and n plates), the transistor will be in the closed state - in the “off” state or, in the language of the binary system, “0”.
When there is enough voltage, the transistor opens and we get the value “on” or “1” in the binary system.
This state, 0 or 1, is called a “bit” in the computer industry.

Those. we get the main property of the very switch that opened the way to computers for humanity!

The first electronic digital computer ENIAC, or more simply put, the first computer, used about 18 thousand triode lamps. The computer was the size of a tennis court and weighed 30 tons.

To understand how a processor works, you need to understand two more key points.

Moment 1. So, we have decided what it is bit. But with its help we can only get two characteristics of something: either “yes” or “no”. In order for the computer to learn to understand us better, they came up with a combination of 8 bits (0 or 1), which they called byte.

Using a byte, you can encode a number from zero to 255. Using these 255 numbers - combinations of zeros and ones, you can encode anything.

Moment 2. Having numbers and letters without any logic would give us nothing. This is why the concept appeared logical operators.

By connecting just two transistors in a certain way, you can achieve several logical actions at once: “and”, “or”. The combination of the voltage across each transistor and the type of connection allows you to get different combinations of zeros and ones.

Through the efforts of programmers, the values of zeros and ones, the binary system, began to be converted into decimal so that we could understand what exactly the computer “says”. And to enter commands, we should represent our usual actions, such as entering letters from the keyboard, as a binary chain of commands.

Simply put, imagine that there is a lookup table, say, ASCII, in which each letter corresponds to a combination of 0 and 1. You pressed a button on the keyboard, and at that moment on the processor, thanks to the program, the transistors switched so that that one appears on the screen the letter written on the key.

This is a rather primitive explanation of the principle of operation of the processor and computer, but it is understanding this that allows us to move on.

5. And the transistor race began

After British radio engineer Jeffrey Dahmer proposed placing the simplest electronic components in a monolithic semiconductor crystal in 1952, the computer industry took leaps forward.

From the integrated circuits proposed by Dahmer, engineers quickly moved to microchips, which were based on transistors. In turn, several such chips have already been formed by CPU.

Of course, the dimensions of such processors are not much similar to modern ones. In addition, up until 1964, all processors had one problem. They required an individual approach - a different programming language for each processor.

1964 IBM System/360. Universal Code compatible computer. The instruction set for one processor model could be used for another.
70s. The appearance of the first microprocessors. Single-chip processor from Intel. Intel 4004 – 10 micron TC, 2,300 transistors, 740 KHz.
1973 Intel 4040 and Intel 8008. 3,000 transistors, 740 kHz for the Intel 4040 and 3,500 transistors at 500 kHz for the Intel 8008.
1974 Intel 8080. 6 micron TC and 6000 transistors. Clock frequency is about 5,000 kHz. It was this processor that was used in the Altair-8800 computer. The domestic copy of the Intel 8080 is the KR580VM80A processor, developed by the Kyiv Research Institute of Microdevices. 8 bit.
1976 Intel 8080. 3 micron TC and 6500 transistors. Clock frequency 6 MHz. 8 bit.
1976 Zilog Z80. 3 micron TC and 8500 transistors. Clock frequency up to 8 MHz. 8 bit.
1978 Intel 8086. 3 micron TC and 29,000 transistors. Clock frequency is about 25 MHz. The x86 instruction system, which is still used today. 16 bit.
1980 Intel 80186. 3 micron TC and 134,000 transistors. Clock frequency – up to 25 MHz. 16 bit.
1982 Intel 80286. 1.5 micron TC and 134,000 transistors. Frequency – up to 12.5 MHz. 16 bit.
1982 Motorola 68000. 3 microns and 84,000 transistors. This processor was used in the Apple Lisa computer.
1985 Intel 80386. 1.5 micron TP and 275,000 transistors. Frequency – up to 33 MHz in the 386SX version.

It would seem that the list could be continued indefinitely, but then Intel engineers faced a serious problem.

6. Moore's Law or how chipmakers can move on

It's the end of the 80s. Back in the early 60s, one of the founders of Intel, Gordon Moore, formulated the so-called “Moore's Law”. It sounds like this:

Every 24 months, the number of transistors placed on an integrated circuit chip doubles.

It is difficult to call this law a law. It would be more accurate to call it empirical observation. Comparing the pace of technology development, Moore concluded that a similar trend could form.

But already during the development of the fourth generation of Intel i486 processors, engineers were faced with the fact that they had already reached the performance ceiling and could no longer accommodate more processors in the same area. At that time, technology did not allow this.

As a solution, an option was found using a number of additional elements:

cache memory;
conveyor;
built-in coprocessor;
multiplier

Part of the computational load fell on the shoulders of these four nodes. As a result, the appearance of cache memory, on the one hand, complicated the design of the processor, on the other, it became much more powerful.

The Intel i486 processor already consisted of 1.2 million transistors, and its maximum operating frequency reached 50 MHz.

In 1995, AMD joined the development and released the fastest i486-compatible processor Am5x86 on a 32-bit architecture at that time. It was already manufactured using a 350 nanometer technical process, and the number of installed processors reached 1.6 million units. The clock frequency has increased to 133 MHz.

But chipmakers did not dare to pursue a further increase in the number of processors installed on a chip and the development of the already utopian CISC (Complex Instruction Set Computing) architecture. Instead, American engineer David Patterson proposed optimizing the operation of processors, leaving only the most necessary computational instructions.

So processor manufacturers switched to the RISC (Reduced Instruction Set Computing) platform. But this turned out to be not enough.

In 1991, the 64-bit R4000 processor operating at 100 MHz was released. Three years later, the R8000 processor appears, and after another two years, the R10000 with a clock frequency of up to 195 MHz. At the same time, the market for SPARC processors developed, the architectural feature of which was the absence of multiplication and division instructions.

Instead of fighting over the number of transistors, chip manufacturers began to reconsider the architecture of their work. Refusal of “unnecessary” commands, execution of instructions in one clock cycle, the presence of registers of general value and pipelining made it possible to quickly increase the clock frequency and power of processors without distorting the number of transistors.

Here are just some of the architectures that appeared between 1980 and 1995:

SPARC;
ARM;
PowerPC;
Intel P5;
AMD K5;
Intel P6.

They were based on the RISC platform, and in some cases, partial, combined use of the CISC platform. But the development of technology again pushed chipmakers to continue expanding processors.

In August 1999, the AMD K7 Athlon, manufactured using a 250 nanometer process technology and including 22 million transistors, entered the market. Later the bar was raised to 38 million processors. Then up to 250 million.

The technological processor increased, the clock frequency increased. But, as physics says, there is a limit to everything.

7. The end of transistor competitions is near

In 2007, Gordon Moore made a very strong statement:

Moore's Law will soon cease to apply. It is impossible to install an unlimited number of processors ad infinitum. The reason for this is the atomic nature of matter.

It is noticeable to the naked eye that the two leading chip manufacturers AMD and Intel have clearly slowed down the pace of processor development over the past few years. The precision of the technological process has increased to just a few nanometers, but it is impossible to accommodate even more processors.

And while semiconductor manufacturers are threatening to launch multilayer transistors, drawing a parallel with 3DNand memory, the x86 architecture, which had hit a wall 30 years ago, had a serious competitor.

8. What awaits “regular” processors

Moore's Law has been invalidated since 2016. This was officially announced by the largest processor manufacturer Intel. Chipmakers are no longer able to double computing power by 100% every two years.

And now processor manufacturers have several unpromising options.

The first option is quantum computers. There have already been attempts to build a computer that uses particles to represent information. There are several similar quantum devices in the world, but they can only cope with algorithms of low complexity.

In addition, the serial launch of such devices in the coming decades is out of the question. Expensive, ineffective and... slow!

Yes, quantum computers consume much less energy than their modern counterparts, but they will be slower until developers and component manufacturers switch to the new technology.

The second option is processors with layers of transistors. Both Intel and AMD are seriously thinking about this technology. Instead of one layer of transistors, they plan to use several. It seems that in the coming years there may well be processors in which not only the number of cores and clock speed, but also the number of transistor layers will be important.

The solution has a right to life, and thus the monopolists will be able to milk the consumer for another couple of decades, but, in the end, the technology will again hit the ceiling.

Today, understanding the rapid development of ARM architecture, Intel quietly announced chips from the Ice Lake family. The processors will be manufactured using a 10-nanometer process technology and will become the basis for smartphones, tablets and mobile devices. But this will happen in 2019.

9. ARM is the future

So, the x86 architecture appeared in 1978 and belongs to the CISC platform type. Those. in itself, it assumes the presence of instructions for all occasions. Versatility is the main strength of x86.

But, at the same time, versatility also played a cruel joke with these processors. x86 has several key disadvantages:

the complexity of the commands and their outright intricacy;
high energy consumption and heat generation.

High performance had to say goodbye to energy efficiency. Moreover, two companies are currently working on the x86 architecture, which can easily be considered monopolists. These are Intel and AMD. Only they can produce x86 processors, which means only they control the development of technology.

At the same time, several companies are developing ARM (Arcon Risk Machine). Back in 1985, the developers chose the RISC platform as the basis for further development of the architecture.

Unlike CISC, RISC involves developing a processor with the minimum required number of instructions, but maximum optimization. RISC processors are much smaller than CISC, more energy efficient and simpler.

Moreover, ARM was originally created solely as a competitor to x86. The developers set the task of building an architecture that is more efficient than x86.

Since the 40s, engineers have understood that one of the priority tasks remains to work on reducing the size of computers, and, first of all, the processors themselves. But it’s unlikely that almost 80 years ago anyone could have imagined that a full-fledged computer would be smaller than a matchbox.

The ARM architecture was once supported by Apple, which launched the production of Newton tablets based on the ARM6 family of ARM processors.

Sales of desktop computers are plummeting, while the number of mobile devices sold annually already numbers in the billions. Often, in addition to performance, when choosing an electronic gadget, the user is interested in several more criteria:

mobility;
autonomy.

The x86 architecture is strong in performance, but once you give up active cooling, the powerful processor will seem pathetic compared to the ARM architecture.

10. Why ARM is the undisputed leader

It’s unlikely that you will be surprised that your smartphone, be it a simple Android or Apple’s 2016 flagship, is tens of times more powerful than full-fledged computers from the late 90s.

But how much more powerful is the same iPhone?

Comparing two different architectures in itself is a very difficult thing. Measurements here can only be taken approximately, but you can understand the enormous advantage that smartphone processors built on ARM architecture provide.

A universal assistant in this matter is the artificial Geekbench performance test. The utility is available both on desktop computers and on Android and iOS platforms.

Mid-range and entry-level laptops clearly lag behind the performance of the iPhone 7. In the top segment, everything is a little more complicated, but in 2017 Apple releases the iPhone X with the new A11 Bionic chip.

There, the ARM architecture is already familiar to you, but the Geekbench scores have almost doubled. Laptops from the “highest echelon” are tense.

But only one year has passed.

The development of ARM is progressing by leaps and bounds. While Intel and AMD year after year demonstrate a 5-10% increase in performance, over the same period smartphone manufacturers manage to increase the power of processors by two to two and a half times.

Skeptical users who go through the top lines of Geekbench would just like to be reminded: in mobile technologies, size is what matters most.

Place an all-in-one PC with a powerful 18-core processor on the table, which “tears the ARM architecture to shreds,” and then place the iPhone next to it. Do you feel the difference?

11. Instead of withdrawal

It is impossible to cover the 80-year history of computer development in one material. But after reading this article, you will be able to understand how the main element of any computer – the processor – works, and what to expect from the market in the coming years.

Of course, Intel and AMD will work to further increase the number of transistors on one chip and promote the idea of multilayer elements.

But do you, as a consumer, need that kind of power?

It’s unlikely that you’re unhappy with the performance of the iPad Pro or the flagship iPhone X. I don’t think you’re unhappy with the performance of your multicooker in your kitchen or the picture quality on your 65-inch 4K TV. But all these devices use processors based on ARM architecture.

Windows has already officially announced that it is looking towards ARM with interest. The company included support for this architecture in Windows 8.1, and is now actively working on a tandem with the leading ARM chipmaker Qualcomm.