How to squeeze 1.5 teraflops of performance for 32-bit floating-point numbers on a single M1 processor

Short description

A blog post explores the performance of Apple’s matrix multiplication instruction set (its Apple Matrix Coprocessor, AMX) found within its latest A14 Bionic chip powering the latest iPhone and iPad models. The article discusses register groups X, Y and Z, the differences in the size of values that can be loaded or stored, and how to write matrix multiplication code using AMX instructions. The test on the AMX using proposed solutions resulted in a few hundred gigaflops. Using a method where 128 bytes are loaded into X and Y registers and a 32×32 block calculated, 1.5 teraflops can be achieved.

How to squeeze 1.5 teraflops of performance for 32-bit floating-point numbers on a single M1 processor

If you work and study large modern neural networks, this article will not be relevant to you, because the speed of the A100 is a hundred times higher (156 teraflops).

So what is interesting about these one and a half teraflops?

  • single-core MacBook Air 2020 with battery power;
  • execution with a delay of ~0.5 nanoseconds for instructions.

We are not talking about powerful accelerators or tensor cores of GPUs, but only about the real performance of linear algebra, which from the registers of the processor for one cycle.

Strangely enough, Apple hides this thing from us! But in this article, we will look at the code that will allow us to lift the veil of secrecy. A header is used for all code aarch.h in the excellent corsix repository.

What is an AMX coprocessor?

It’s basically SIMD on steroids. An important feature is that the AMX:CPU ratio is not 1:1. The AMX coprocessor does not have every core.

Here are the sizes you can use to load or store values:

Size the minimum equal to the width of a full AVX512 register.

But where are these values ​​loaded or stored? Of course, these sizes would fill up the NEON register file pretty quickly. There is a separate registry file specifically for AMX, it looks very strange.

The registers are divided into groups X, Y, and Z. Groups X and Y store the input values ​​of each instruction, and group Z stores the output values.

Groups X and Y are no longer small! They used a whole kilobyte. Well, Z is not in any gate at all:

Spoiler: 1024 bytes (1/4 of the Z-register group) can be completely occupied by one AMX instruction.

And how to get from X and Y to Z? There are so many ways that they hardly fit in the ISA encoding, so most of the information about Apple’s instructions is encoded in a general-purpose register. It turned out that it is cool to work with, because it is possible to configure the code on AMX on the fly, during its execution.

The goal of this position is to achieve maximum power from the coprocessor. There are vector multiplication instructions that output vectors of the same length, but they do not exhaust the computing capabilities of this chip. To achieve a real effect, you will have to use an external work instead.

What is an external work? Let there be two vectors u and v? then:

The external product of these vectors is a matrix containing the products of all possible combinations of their paired elements. This explains a little why the Z register group is so much larger than the X and Y register groups.

In the AMX chip, this boils down to a very simple instruction like this:

There is also a flag that can be set to accumulate data from the previous result:

So we have everything we need to write matrix multiplication: reload 16 floating-point values ​​from the input matrices and sum their outer products with increments in the 16×16 output. At the same time, reducing the dimension of K does not play any role!

Let’s simplify the task and implicitly transpose the matrix multiplication. As Aso and so B (our input) will have the decrease dimension as the main dimension K. It actually makes a big difference, but it simplifies the code a lot.

Here is a link to check the proposed solution:

void reference_16x16xK(float *A, float *B, float *C, uint64_t K) {
  for (uint32_t m = 0; m < 16; ++m) {
    for (uint32_t n = 0; n < 16; ++n) {
      C[n * 16 + m] = 0;
      for (uint32_t k = 0; k < K; ++k) {
        C[n * 16 + m] += A[k * 16 + m] * B[k * 16 + n];
      }
    }
  }
}

And here’s how to test it on AMX:

// only set for k == 0
uint64_t reset_z = 1ull << 27;

for (uint32_t k = 0; k < K; ++k) {
  uint64_t idx = k % 4;
  // 64 bytes = 16 floats
  AMX_LDX((uint64_t)A + k * 64);
  AMX_LDY((uint64_t)B + k * 64);

  // now we do 4 indepedent outer products (avoiding pipeline hazards)
  AMX_FMA32(reset_z);
  reset_z = 0;
}

It is noteworthy that we did not apply for the address to none register More precisely, they applied, but secretly. As well as reset_z is coded by a bit mask, register addresses are coded in the passed arguments AMX_*. The pointers to A and B only use up to 56 bits, so Apple engineers hid the data in the remaining eight bits. We’ve just accidentally set them to 0. So the case uses the “0” registers for X and Y.

The code for saving the Z registers in memory is a little more complicated: only the first column is filled. This means that you need to occupy registers 0, 4, 8, etc.

for (uint64_t i = 0; i < 16; ++i) {
  const uint64_t z_register = (i * 4ull) << 56;
  AMX_STZ(z_register | (uint64_t)C + i * 64);
}

Unfortunately, if we download the above code, we will see that it slows down terribly. After all, it’s only a few hundred gigaflops. Why? There are two reasons.

The first is a conflict during pipeline processing (pipeline hazard)

pipeline hazard is a general name for cases where the results of one instruction are required to execute the next one before the first one completes.

Each AMX_FMA32 instruction depends on the previous one, as we sum the increments into a single subset of the register file. As a result, 25% of the registry file is used at full capacity, and we leave the rest idle, which prevents instruction-level parallelization.

The second reason is inefficient loading from memory. It is possible to download 128 bytes at a time, but the code above only downloads 64 bytes. Similarly, we can perform downloads in others registers, rather than loading into the same ones every time. This also allows a little parallelization at the level of instructions.

And what is our plan?

We plan to load 128 bytes into X and Y and then calculate a 32×32 block. This requires 4 independent calculations of 16×16 blocks, which will allow to achieve parallelism at the level of instructions, as well as more efficient use of the loaded memory (each 64-byte register is used twice).

void mm32x32xK(float* A, float* B, float* C, uint64_t K) {

  // flag to load/store 128 bytes
  const uint64_t load_store_2 = 1ull << 62;
  const uint64_t load_store_width = 128; // in bytes

  // only set for k == 0
  uint64_t reset_z = 1ull << 27;

  for (uint32_t k = 0; k < K; ++k) {
    uint64_t idx = k % 4;
    // load to X, Y (skipping every other index because we're loading 128 bytes)
    AMX_LDX(load_store_2 | (idx * 2) << 56 | (uint64_t)A + k * load_store_width);
    AMX_LDY(load_store_2 | (idx * 2) << 56 | (uint64_t)B + k * load_store_width);

    // offset into X and Y registers is byte-wise
    const uint64_t offset = idx * load_store_width;

    // now we do 4 indepedent outer products (avoiding pipeline hazards)
    AMX_FMA32(reset_z | (0ull << 20) | ((offset +  0ull) << 10) | ((offset +  0ull) << 0));
    AMX_FMA32(reset_z | (1ull << 20) | ((offset + 64ull) << 10) | ((offset +  0ull) << 0));
    AMX_FMA32(reset_z | (2ull << 20) | ((offset +  0ull) << 10) | ((offset + 64ull) << 0));
    AMX_FMA32(reset_z | (3ull << 20) | ((offset + 64ull) << 10) | ((offset + 64ull) << 0));
    reset_z = 0;
  }

  for (uint64_t i = 0; i < 16; ++i) {
    // store interleaved
    AMX_STZ(load_store_2 | ((i * 4ull + 0) << 56) | (uint64_t)C + i * load_store_width);
    AMX_STZ(load_store_2 | ((i * 4ull + 2) << 56) | (uint64_t)C + (16 + i) * load_store_width);
  }
}

I’ve added comments above, but there are a few more interesting bits with instruction flags. Corsix does a great job of explaining everything, so I’ll just leave a link to the explanation:

How fast did we get to this? Well, it depends on the number of kilobytes, but we reach 1.5 TFlops at high values.

Not surprisingly, large tasks get better relative performance because the cache has more room to warm up and the processor has more time to cycle through instructions.

In general, against the background of large modern neural networks that work with general artificial intelligence, these dimensions are simply microscopic. However, this type of performance opens the door to smaller neural networks that could find a place in real-world computing. If a prediction can be done on a battery-powered laptop in a few tens of nanoseconds, it is likely to have more value than the value we could add in places where heuristics might otherwise be applied.

In other words, in the opinion of the author, the performed forecast has a value greater than the value that could be achieved through heuristics.

And what do you think about this? Thank you for attention!

Short catalog of courses

Data Science and Machine Learning

Python, web development

Mobile development

Java and C#

From the basics – in depth

And

Related posts