Why the server is not just a “big macbook”. Part 1

Why the server is not just a “big macbook”. Part 1

Memory topology

Very often I see an approach to servers and computing infrastructure at the kitchen level, even from seemingly professional people with the highest ZP of half a million and above.
The server is a big macbook, and the storage is just a big disk.

So, let’s figure it out. And let’s start with a terrible topic – memory topology.

I will explain in real terms on acorns and pine cones, so a request to cool super pros right away – don’t try to expose me, this is not written for you.

There is a processor that executes commands and an access channel to the memory that is needed for these commands. The channel is characterized by two indicators: access delay (how many nanoseconds the response takes), channel width (how many gigabytes per second).

Delay. UMA/NUMA

UMA – Uniform Memory Access, an architecture with uniform access to memory. In a multiprocessor (multicore) system, the distance (delay) from any processor (core) to any memory block is constant and the same.

NUMA – NonUniform Memory Access, an architecture with non-uniform memory access. In a multiprocessor (multicore) system, the distance (delay) from the processor (core) of the memory block is different.

What is the difference for a programmer/architect-designer? In the case of NUMA, the system needs to think through the specifics of system deployment (modules) and configuration in order to run at full speed while avoiding (minimizing) remote accesses. If you do not understand how to work with it, then you will easily get up to minus 30% productivity.

Let me remind AMD lovers that recent generations of AMD servers had NUMA architecture even inside the socket. That is, in fact, it was a hack – two crystals were packed into one case.

Topology of processors (at >2)

In the case of 2 CPUs, it is clear whether CPU1 is present <-> CPU2. But already with 4 processors it becomes more interesting. There are options with a fully connected system and intermediate nodes.
A fully connected system – when there is a direct channel from each processor to each, and the delay when accessing someone else’s memory is the same.
A system with intermediate nodes has fewer links. That is, only two links for each processor.

  • CPU1 <-> CPU2 <-> CPU4

  • CPU1 <-> CPU3 <-> CPU4 And in this case, CPU3 receives a double delay when accessing the memory of CPU2, which further loses performance.

Memory width

Width (PSP – memory bandwidth) is provided by a whole bunch of parameters, including two: the number of memory channels on the memory controller, the speed of the module.

And let’s start with speed. Everyone who has chosen (assembled) a computer has seen such a thing as DDR4-2933 (numbers may vary). All on the fingers – DDR3, DDR4, DDR5 – this is conventionally the format of the memory socket, i.e. you cannot insert a DDR5 module into a server with DDR3 slots. And 2933 (sometimes PC2933) is actually the speed of this module. Moreover, it is measured in MHz, as it is sometimes mistakenly written, and MT/s.
DDR – Double Data Rate – a method of transmission at a double speed, when information is transmitted not by 0/1 itself, but by the steepness of the signal front (and there are two fronts). Therefore, the true frequency is twice as low. That is, PC3200 is 3200 MT/s (mega transfers/sec) running at 1600 MHz.

And that’s when we start eating up the full width and are limited in performance by memory speed.

That’s why multi-channel memory controllers were invented – you can simultaneously access several memory modules independently of each other. Which ultimately increases theoretical PSP in N times (N – number of channels). And for a two-processor system, 2 N times, respectively. Theoretical – because in reality everything depends on how the data was spread over the physical memory modules, of course.

And here comes the question: this is how cool memory modules are sold, 6000, let’s fill the server with them.

There are two points here:

  • Cool 6000-type modules are NON-REGISTERED memory for overclocking enthusiasts. If you suddenly become overwhelmed, it’s okay.

  • server memory – registry, that is. there is an additional microcircuit for the possibility of installing a large amount of memory (more modules per channel).

In short, depending on the class of the server, such memory either cannot/or does not need to be installed (as a recommendation).

Why such a fundamental difference in the desktop/northern world?

Desktop processors are processors for single-processor machines, with a maximum of two memory channels (mass segment). When you have exhausted two channels, all that remains is to increase the frequency.

Server processors solve the problem of memory width by means of multiprocessor configurations with many (more than 2) channel memory. For example, Xeon Scalable v2 has 6-channel memory (and this determines the amount of memory in multiples of 6 modules – 192, 384, 768).

Desktop 1 * 2 * DD5 5600 = 11200 (Core i9)
Server 2*6* DDR4 3200 = 38400 (Xeon Scalable v2)
Server 2*8* DDR5 4000 = 64,000 (Xeon Scalable v4)

You may object, but what about AMD Ryzen ThreadRipper? There are eight memory channels. (Intel had a Core X series with 4 channels).
Yes, you are right. Just look at its cost, it will be more expensive than other Xeons, and this is what determines its extremely niche application, where it is already fighting for a place with workstations on server processors.

* from marking. Can use DDR4 3200 = PC4 25600 (MB/s)

Choice of processor.

The question arose here in the chat – which processor to take, Xeon 5317 or 6330? Let’s use this example to analyze what to look for and what is important.

For the simplicity of the task, let’s immediately take as a given that our server is large, at least 2U, and there are no problems with either power or cooling. Without delving into why this particular couple, let’s take a look at it.

Both processors belong to the same generation, Xeon Scalable v3

5317 – 12 cores at 3 GHz, 3.6 turbo, 18 MB L3
6330 – 28 cores at 2 GHz, 3.1 turbo, 42 MB L3

The easiest way to compare head-on is the kernel base. A total of 36 against 56 (we consider all hypertrading coefficients to be the same and simply throw them out of consideration). It would seem that what else is there to compare?

And here lies the problem of having an understanding of how the processor, the OS kernel, the hypervisor scheduler, and the application software work.
Purely theoretically, the 6330 exceeds the 5317 in 56/36 by 1.56 times, but to realize this advantage, it is necessary to apply an appropriate load to it. Namely, a set of weakly connected threads capable of loading the same overwhelming number of cores. And especially loosely connected ones, because as the connectivity and dependence of flows grows, the fragmentation of productive work time and synchronization overheads will begin. That is, this is a processor for a mass of VDI machines with 2-4 vCPUs or a bunch of some containers, and really a bunch, with a vCPU/pCore ratio of 5 and higher.
If you move from many small machines to a number of larger machines with a general decrease in vCPU/pCore, sooner or later the factor of low frequency per core starts to play.
Maximum performance of a VM with 6 vCPUs on 5317 = 18 GHz. For the 6330, it is purely hypothetical to give 9 vCPUs, but as we understand, a situation with an even load of cores is practically impossible, if it is not a calculation farm. And in a real situation, there is often a load on 1-2 cores. That is, in fact, it is not 18 GHz, but 6 against 4. Or, taking into account the turbo, 7.2 against 6.2. And what is interesting, it is much more difficult to catch a turbo on a 28-core processor, i.e. 7.2 against 4-5.

6330 has almost 3 times more L3 cache, no argument, extremely useful purchase compared to other equals.

Now let’s try to find out if there are any current load statistics to predict demand.
The author of the question gave the following statistics: a server with 2*Xeon 4210 (10*2.0, 3.2 Turbo) and 768 memory is loaded by 22% and 45% by processor and memory, respectively. Or roughly, we get 1 GHz / 32 GB.
With this loading, balanced configurations can be predicted for the 5317/6330. And this is 2.3 and 3.5 TB of RAM, respectively.

That is, Practically, even when installing 2 TB of RAM on the server, the 6330 cannot show its potential power and advantage over the 5317. But the 5317 not only exceeds the current 1/32, but also gives greater performance to the same machines due to a higher frequency per core.

Related posts