About one meta-optimization

About one meta-optimization

For a bright title, the term “super-optimization” or even “hyper-optimization” should be used here. But the prefixes “super” to everything in the world are so far-fetched that, for example, the completely normal and even scientific term “super-programming” has become more associated with the achievements of some unknown “super-programmers” and not with the methods of converting programs. Folk art, on the other hand, defines “super-programming” as programming while cooking soup. Therefore, I will not use the boastful “super”, but only the humble “goal”.

Meta-optimization (or, more simply, over-optimization) is optimization applied to other optimization methods. I really like to invent all sorts of optimizations in the compiler I support. But it somehow did not occur to me to apply meta-optimization. An accident happened.

The compiler I have the honor of overseeing has a place where x86-64 registers are allocated for addresses: pointers, function parameters, etc. In terms of language for based variables

When allocating registers, the compiler first tries to find the most profitable register, for example, which will immediately be used again for the same purpose and therefore needs to be reloaded with the same value. If no profitable ones are found, the first free one is selected. There is nothing special or complicated here. Apparently, the only question is: which of the free registers to allocate first – one of the “traditional” ones for x86 type RBX/RSI/RDIor one of the “new” ones in x86-64 ie. R8-R15?

On the one hand, R8-R15 are less often used, for example, in the system library (it is redesigned for x86-64 from x86), and, therefore, there is a greater chance that their values ​​​​are often not erased during the execution of the program and will require repeated downloads. But on the other hand, use R8-R15 requires adding REX-prefixes to commands, which increases the amount of code.

For example, a team PUSH RAX takes one byte, but it’s about the same PUSH R8 – two Team INC B PTR [RBX] does not require a REX-prefix, a INC B PTR [R10] – Requires, etc. Once I tried to allocate registers first from free ones in the compiler R8-R15 and immediately saw that the code was simply bloated without any visible advantages. But he did not continue to experiment.

Digressing a little to the side. With the advent of x86-64, a funny question arose. There is a command in x86 MOV EAX, EBX. You can get the same result using a couple of commands PUSH EBX + POP EAX. But there is no reason to believe that this pair will work better than a normal shipment. In pairs and commands are dependent on each other through the stack and there is writing and reading from memory. And for x86-64, all these considerations, of course, remain. But there is a nuance. In the forwarding team MOV RAX, RBX now a REX prefix is ​​required and takes three bytes. And in a pair of forwarding through the stack PUSH RBX + POP RAX no REX prefix is ​​needed and the pair takes only two bytes instead of three. That is, there was also an argument in favor of forwarding through the stack – code reduction. How much this advantage outweighs (and does it outweigh?) ​​the disadvantages is, of course, a question.

But let’s return to optimization. Recently, for the purpose of debugging and for a completely different reason, I again set a preference in the compiler when selecting from free for R8-R15. And I was surprised to see that the size of the code (that is, the commands) of the program, which consists of many modules, even slightly decreased compared to the predominant selection of “traditional” registers. It turns out that part of the modules increased, but, therefore, the other part still decreased and so that it overpowered the increase in the code of others!

That’s when the idea came to apply meta-optimization, using the entire compiler as a lower-level optimizer. Compile each module several times, selecting registers one way or another. Then, with a separate simplest program, compare the size of the code in all variants of the received object modules. And choose the option with the shortest code.

I’ve highlighted only 16 code-affecting priority settings when allocating registers. In general, since there are 16 general-purpose registers, and the register RSP if we do not highlight, there should be 32768 different combinations. But 99.99% of these options will not change. And so, each bit of the variant number means that the specified address register should not be allocated, even if it is free. The first bit is the register RBXsecond – RSIthird – RDI and the fourth RCX. For example, option #3 means that no case should be allocated for addressing RBXno case RSIeven if they are not currently occupied by any values. It looks like Holmes’ instructions to Dr. Watson. Do you remember? “Don’t get into either the first or the second taxi that comes your way.” What this will result in the program and how it will affect the size of the teams is difficult to predict.

But it is easier to insert this condition into the compiler. After all, he still checks whether the register is currently occupied or not. And yes, it checks whether it is busy and at the same time sets the corresponding bit in the variable. If the bit is set, it doesn’t matter that the register is busy.

So, 16 versions of compilations. In order not to change the scripts of the automatically created program assemblies, the number of the compilation option is not set through the command line of the compiler call, but through the Windows registry, where one variable is entered, which the compiler reads when it starts. Of course, if there is no such variable in the registry, then the null option is the default.

A script with a list of all compiled modules is also created automatically, and then run inside the “main” script of the general assembly. Only this place requires changes to turn into an enumeration of all build options.

Below is a snippet of that script. The best options are stored in files with the .OOO extension.

Everything is performed by the script procedure REG_ADD, which first creates a basic set of object modules, and then repeats the compilation of all modules 15 more times and selects the best samples from the next version 15 times with the help of a simple program specially written for this purpose, CODE_OBJ.EXE.

And the actual compilation of all modules continues to be performed by the CALL ML.BAT command, after which 40 object modules are created.

Сначала всякие подготовительные операции
…
REM ---- СОЗДАЕМ ПЕРВЫЕ ВАРИАНТЫ *.OBJ ----
DEL *.OOO
CALL :REG_ADD 0
COPY *.OBJ *.OOO
REM ---- СОЗДАЕМ ОСТАЛЬНЫЕ ВАРИАНТЫ *.OBJ, ЛУЧШИЕ СОХРАНЯЕМ В *.OOO ----
FOR /L %%J IN (1,1,15) DO CALL :REG_ADD %%J
REM ---- ЛУЧШИЕ ВАРИАНТЫ *.OBJ ВОЗВРАЩАЕМ ПО МЕСТАМ ----
COPY *.OOO *.OBJ
…
Далее выполняется штатная сборка EXE-файла, но уже из лучших модулей

Here is the text of the script procedure:

REM ---- ИЗ РАЗНЫХ ВАРИАНТОВ *.OBJ СОХРАНЯЕМ В *.OOO ЛУЧШИЕ ----
:REG_ADD
@ECHO .
@ECHO %1
@REG ADD HKCC\Software\PL/1 /v RBX /t REG_DWORD /d 0000000%1 /f >NUL
@DEL *.OBJ 
@CALL ML.BAT
@IF "%1" == "0" GOTO :EOF
@REM ---- СОХРАНЯЕМ В *.OOO ЛУЧШИЕ ВАРИАНТЫ *.OBJ ----
@ECHO OFF
@FOR %%I IN (*.OOO) DO CODE_OBJ.EXE %%I %%~NI.OBJ
@GOTO :EOF

Thus, the completions of the compiler and the EXE file assembly script are the most primitive. A simple program for selecting the shortest version of the object module is attached. Here is the text of this program, which compares the command sizes of two OMF files and saves the shorter one in a file with the .OOO extension.

I do not propose to understand this text, but simply show how small it is:

CODE_OBJ:PROC(S) MAIN; 
DCL 
 S                CHAR(*) VAR,
 F1               FILE,
 P1               PTR,
 X(0:100_000_000) BIT(8) BASED(P1),
(I,J,K)           FIXED(31),
 B32              BIT(32) DEF(J),
 B8               BIT(8)  DEF(J),
 B8_1             BIT(8)  DEF(J+1),
 B8_2             BIT(8)  DEF(J+2),
 B8_3             BIT(8)  DEF(J+3),
 ЗАГОЛОВОК        BIT(8),
(ИМЯ_1,ИМЯ_2)     CHAR(*) VAR, 
(COPYFILEA        ENTRY(CHAR(*) VAR,CHAR(*)VAR,BIT(32)),
 GETLASTERROR)    RETURNS(FIXED(31)) IMPORT;

//---- ЧИТАЕМ ДВА ИМЕНИ ФАЙЛОВ ЧЕРЕЗ ПРОБЕЛ ----
GET STRING(S) LIST(ИМЯ_1,ИМЯ_2);
//---- ОТКРЫВАЕМ ПЕРВЫЙ OBJ - ФАЙЛ (ЛУЧШИЙ) ----
ON UNDEFINEDFILE(F1) PUT SKIP LIST('НЕТ',ИМЯ_1,STOP(0));
OPEN(F1) INPUT RECORD TITLE(ИМЯ_1) ENV(B(-1));// ОТОБРАЖЕНИЕМ НА ПАМЯТЬ
READ(F1) SET(P1);
REVERT UNDEFINEDFILE(F1);
//---- ЕСЛИ НЕТ OMF-ЗАГОЛОВКА ----
IF X(0) ^= '80'B4 THEN STOP(0);
//---- ПРОПУСКАЕМ ВСЕ ЗАПИСИ ДО ПЕРВОГО SEGDEF-32 ----
DO WHILE(ЗАГОЛОВОК^='99'B4);
   J=0; B8=X(I+1); B8_1=X(I+2);     // ДОСТАЛИ ДЛИНУ
   I+=J+3;                          // ПРОПУСТИЛИ ЗАПИСЬ
   ЗАГОЛОВОК=X(I);                  // ЗАГОЛОВОК СЛЕДУЮЩЕЙ ЗАПИСИ
   IF ЗАГОЛОВОК='98'B4 THEN STOP(0);// ЭТО БЫЛ OMF ОТ АССЕМБЛЕРА	
END WHILE;
//---- ДОСТАЕМ ДЛИНУ СЕГМЕНТА SEGDEF-32 ----
B8  =X(I+4);
B8_1=X(I+5);
B8_2=X(I+6);
B8_3=X(I+7);
K=J; // ЗАПОМНИЛИ ДЛИНУ ДЛЯ БУДУЩЕГО СРАВНЕНИЯ
CLOSE(F1);
//---- ОТКРЫВАЕМ ВТОРОЙ OBJ - ФАЙЛ (ТЕКУЩИЙ) ----
ON UNDEFINEDFILE(F1) PUT SKIP LIST('НЕТ',ИМЯ_2,STOP(0));
OPEN(F1) INPUT RECORD TITLE(ИМЯ_2) ENV(B(-1));// ОТОБРАЖЕНИЕМ НА ПАМЯТЬ
READ(F1) SET(P1);
I=0;
ЗАГОЛОВОК='00'B4;    // ПОКА ОПЯТЬ НЕТ OMF-ЗАГОЛОВКОВ
//---- ЕСЛИ НЕТ ЗАГОЛОВКА OMF-ФАЙЛА ----
IF X(0) ^='80'B4 THEN STOP(0);
//---- ПРОПУСКАЕМ ВСЕ ЗАПИСИ ДО ПЕРВОГО SEGDEF-32 ----
DO WHILE(ЗАГОЛОВОК^='99'B4);
   J=0; B8=X(I+1); B8_1=X(I+2);     // ДОСТАЛИ ДЛИНУ
   I+=J+3;                          // ПРОПУСТИЛИ ЗАПИСЬ
   ЗАГОЛОВОК=X(I);                  // ЗАГОЛОВОК СЛЕДУЮЩЕЙ ЗАПИСИ
   IF ЗАГОЛОВОК='98'B4 THEN STOP(0);// ЕСЛИ ЭТО БЫЛ OMF ОТ АССЕМБЛЕРА	
END WHILE;
//---- ДОСТАЕМ ДЛИНУ СЕГМЕНТА ----
B8  =X(I+4);
B8_1=X(I+5);
B8_2=X(I+6);
B8_3=X(I+7);
CLOSE(F1);
//---- ТЕКУЩИЙ OBJ-ФАЙЛ БОЛЬШЕ ПО КОДАМ, ЧЕМ ЛУЧШИЙ ----
IF K<=J THEN STOP(1);  // НИЧЕГО НЕ НАДО ДЕЛАТЬ
//---- ТЕКУЩИЙ OBJ-ФАЙЛ КОРОЧЕ ЛУЧШЕГО, КОПИРУЕМ ЕГО КАК ЛУЧШИЙ ----
IF COPYFILEA(ИМЯ_2,ИМЯ_1,'00000000'B4)^=0 THEN 
  PUT SKIP EDIT('СКОПИРОВАН ЛУЧШИЙ ',ИМЯ_2,J,' БАЙТ В ')(A,A,COL(35),F(8),A)
                                    (ИМЯ_1,K,' БАЙТ')   (  A,COL(65),F(8),A);
  ELSE PUT SKIP LIST('ОШИБКА',GETLASTERROR,STOP(0));
STOP(1);
END CODE_OBJ;

First, I simply tried to compile the application 16 times for different case allocation options and wrote 16 sizes of EXE code. It can be seen that any attempt to prevent free registers from being allocated RBX/RCX/RSI/RDI more often leads to increased code size. And for two variants, it was not possible to compile one of the 40 modules at all, because invalid type commands were output MOV AH,[R10] due to the “lack” of traditional registers.

And here is the result of running the script with 16 compilations of 40 modules and choosing the shortest one.

A fragment of the CODE_OBJ.EXE listing. It can be seen how there are increasingly short versions of object modules:

A fragment of the CODE_OBJ.EXE listing. It can be seen how there are increasingly short versions of object modules:

As a result, an EXE file with the size of the commands was compiled 15B76B, which is shorter than the best option above. So to speak, a “collective” of modules from all “club” modules came out. The analogy with sports teams was also revealed in the fact that almost all “teams” (that is, almost all options) delegated their “best players” to the “team”, despite the overall average level of each “team” (most modules did not decrease in size, and increased).

So, since the compiler found some kind of “bifurcation points” that slightly, but change the code generation in an unpredictable way in terms of the total size of the commands, it was possible to apply “external” optimization to the compilation results of each module. And since the compiler performs its optimizations every time, the result is optimization of optimization or meta-optimization.

As if it would be possible to shove all these problems inside the compiler and look for the best option there. But for me, the advantage of this implementation is precisely that there is no need to change the compiler, increasing the risk of errors. The addition of four bit check commands should not be considered a major change.

It can be said that the results of this meta-optimization turned out to be weak. In a program consisting of 1.4 Mb of commands, it was possible to reduce this way to less than 2 Kb. However, let alone this penny begins to protect the kopeck. After all, the winnings came almost for nothing. In just half an hour of writing a program for choosing a shorter object module. Even the compilation time, despite 16 options, slowed down not critically: it was 3 seconds, 48 ​​seconds. Don’t worry, a new assembly is not performed every minute, this is a rare event.

A more interesting question is: what was the reason for the reduction of the code at all? If in most cases the refusal to use register for addressing RSI or RDI increases the code, or not decreases it?

It is due to the application of each approach to allocation of registers to each specific module. Each compiled module (its source text) becomes a kind of compiler setting, because the inverse appears in the form of an estimate of the code size. For most modules there is no gain, but for one particular module there may be.

Usually, the role of feedback is played by a programmer who analyzes the results of the program and tries to find the best option using the compiler keys.

Here, although in the simplest form and according to only one parameter (the size of commands), this feedback is automated.

In general, if there is a clear idea of ​​optimization during compilation, of course, it should be included directly in the compiler without using any meta-optimizations. However, if, as in this case, the merits of different register allocation options appear vague, and the analysis of this part in the compiler itself seems overwhelming, it may be appropriate to simply go to a higher level and use the compiler as an integral element in the more general process of finding a better solution.

Related posts