Fast parsing of 8-bit integers

Fast parsing of 8-bit integers

Let’s say you need to quickly parse 8-bit integers (0, 1, 2, …, 254, 255) from an ASCII/UTF-8 string. The task is taken from the simdzone project led by Jeroen Koekkoek (NLnet Labs). Given a string and its length: for example ’22’ and length 2. A naive solution in C might look like this:

int parse_uint8_naive(const char *str, size_t len, uint8_t *num) {
  uint32_t n = 0;
  for (size_t i = 0, r = len & 0x3; i < r; i++) {
    uint8_t d = (uint8_t)(str[i] - '0');
    if (d > 9)
     return 0;
    n = n * 10 + d;
  }
  *num = (uint8_t)n;
  return n < 256 && len && len < 4;
}

This code is a C function that takes as input arguments a character string, its length, and a pointer to an 8-bit unsigned integer. The function returns a Boolean value that determines whether the input string can be parsed into an 8-bit unsigned integer or not. It limits input to a maximum of three characters, but allows leading zeros (for example, 002 is a 2).

First, the function initializes a 32-bit unsigned integer n, setting it to 0, we will store the answer. The function then iteratively loops through the input string, extracting each numeric character from the string and converting it to an unsigned 8-bit integer. The subtracted number is then multiplied by 10 and added to n. This process continues until the end of the line or until the function has processed 4 bytes of the line. Finally, the function returns a value n to the unsigned 8-bit integer it points to num. It then returns a boolean indicating whether the paired integer is less than 256 and the length of the input string (1 to 3 characters). If the input string contains any non-numeric characters or if the string length is greater than 3 bytes, then the function returns false.

If the length of the input data is predictable, this naive function should run fast. However, if the length varies (from 1 to 3), then the processor will be prone to incorrect branching predictions and costly computing resources.

C++ can be called from_chars:

int parse_uint8_fromchars(const char *str, size_t len, uint8_t *num) {
  auto [p, ec] = std::from_chars(str, str + len, *num);
  return (ec == std::errc());
}

Function std::from_chars receives three arguments: a pointer to the beginning of the input sequence of characters, a pointer to its end, and a reference to the output variable in which the parsed integer value will be stored. The function returns an object std::from_chars_resultcontaining two members: a pointer to the first unparsed character and an error code indicating whether parsing was successful.

In this function is called std::from_chars with input string and length as arguments, and a pointer to an 8-bit unsigned integer to hold the sparse value. The function then checks the return for equality std::from_chars error code std::errc()which indicates the success of the parsing. If the parsing was successful, the function will return true. Otherwise, she will return false.

Can you do better than these features? It is not so obvious. We typically aim to optimize functions that read between 1 and 3 bytes. However: is it possible? Can we speed things up?

Let’s say we can always read 4 bytes even if the string is shorter (ie there is a buffer). This is often a fairly safe assumption. If the numbers are inside a longer string, we can often effectively check that we are within four bytes of the end of the string. Even if it doesn’t, reading four bytes is always safe as long as we don’t cross a page boundary, which is very easy to check.

That is, you can write the input data into a 32-bit word and process all the bytes at once in one register. This is often called SWAR: in computer science, SWAR means “SIMD within a register”; it is a method of performing parallel operations on the data that is in the register of the processor.

Jeroen Koekkoek originally came up with a completely working technique with SWAR, but it was only about 40% faster than the naive solution for unpredictable inputs, and potentially slower than the naive solution for predictable inputs. Let’s explore a solution that can be competitive in any case:

int parse_uint8_fastswar(const char *str, size_t len, 
    uint8_t *num) {
  if(len == 0 || len > 3) { return 0; }
  union { uint8_t as_str[4]; uint32_t as_int; } digits;
  memcpy(&digits.as_int, str, sizeof(digits));
  digits.as_int ^= 0x30303030lu;
  digits.as_int <<= ((4 - len) * 8);
  uint32_t all_digits = 
    ((digits.as_int | (0x06060606 + digits.as_int)) & 0xF0F0F0F0) 
       == 0;
  *num = (uint8_t)((0x640a01 * digits.as_int) >> 24);
  return all_digits 
   & ((__builtin_bswap32(digits.as_int) <= 0x020505));
}

This code is also a C function that receives as input arguments a character string, its length, and a pointer to an 8-bit unsigned integer. The function returns a Boolean value that determines whether the input string can be parsed into an 8-bit unsigned integer. We check if the length is in the interval ([1,3]), and if not, then we return false, stopping the execution of the function. After this initial check, the function first initializes the union digits , containing an array of four unsigned 8-bit integers and one 32-bit unsigned integer. The function then copies the input string into a 32-bit unsigned integer using the function memcpy. Functionmemcpy copies the input string into a 32-bit unsigned integer. In production code, when you don’t know the target platform, you may need to display the byte order if the target platform is in big endian format. Big endian systems are extremely rare: they are mainly mainframes. Modern systems compile byte order inversion into single, fast commands. In the code of my post, I assume that you will not encounter a big endian system, which is true 99.99% of the time.

The function then reverses the bits of a 32-bit unsigned integer using an XOR operator and a constant 0x30303030lu. This operation converts each numeric character of the input string to the corresponding decimal value. And indeed: ASCII characters 0 to 9 have ASCII values ​​of code points 0x30 to 0x39. The function then shifts the 32-bit unsigned integer to the left by a number of bits that depends on the length of the input string. This operation removes all trailing bytes that do not belong to the input string. Strictly speaking, when the length is not within the permissible interval ([1,3]), shift can be an undefined behavior, but we still return false to indicate that the result of the calculation is invalid.

The next part is where I contributed to the procedure. First, we use a short expression to check that all characters are numbers. The function then multiplies a 32-bit unsigned integer by a constant 0x640a01, which uses a 32-bit unsigned integer. This is a short way to do two multiplications (by 100 and 10) and two additions at once. Note that 0x64 is 100 and 0x0a is 10. The result of this multiplication is then shifted to the right by 24 bits. This operation retrieves the most significant byte of a 32-bit unsigned integer representing a paired 8-bit unsigned integer. At the end, the function returns the value of the paired 8-bit unsigned integer 8-bit unsigned integer pointed to by num. It then returns a Boolean value indicating whether the steamed integer is less than 256 and whether it consists entirely of digits.

On x64 processors, the function can be compiled into about twenty assembler commands:

lea rcx, [rsi - 4]
xor eax, eax
cmp rcx, -3
jb .END
mov eax, 808464432
xor eax, dword ptr [rdi]
shl esi, 3
neg sil
mov ecx, esi
shl eax, cl
lea ecx, [rax + 101058054]
or ecx, eax
test ecx, -252645136
sete cl
imul esi, eax, 6556161
shr esi, 24
mov byte ptr [rdx], sil
bswap eax
cmp eax, 132358
setb al
and al, cl
movzx eax, al
.END: # %return
ret

Wrote a benchmark to test these functions. The benchmark uses random input values ​​or sequential input (0,1, …) and it proved to be very relevant. I used GCC 12 and a Linux server on Ice Lake (Intel) with 3.2 GHz. The table shows the number of millions of steamed numbers per second.

random numbers

consecutive numbers

std::from_chars

145 million/s

255 million/s

naive

210 million/s

365 million/s

SWAR

425 million/s

425 million/s

That is, with unpredictable input values, the solution with SWAR is twice as fast as the naive solution. In the case of predictable inputs, we outperform the naive solution by about 15%. Whether this will be useful in your particular case depends on your data, so it’s worth doing your own benchmarks.

Importantly, the technique with SWAR is completely equivalent to the naive solution, except that it always reads 4 bytes, regardless of the length.

The results from from_chars are absolutely disappointing. I’m at a loss as to why the naive solution is so much faster than the standard library. It solves a slightly different task, but the difference is still quite large. Perhaps the standard library (GCC 12) still has room for improvement.

Can you do better? The benchmark is posted on github. As part of the benchmark, we also carefully check the correctness of the validation.

Acknowledgments: I am grateful to Jeroin Koekkoek of NLnet Labs for suggesting this task to me The described solution is improved based on a suggestion by Jean-Marc Bourget.

Addition: user Perforated Bob suggested a version that will be faster in some compilers:

int parse_uint8_fastswar_bob(const char *str, size_t len, uint8_t *num) {
  union { uint8_t as_str[4]; uint32_t as_int; } digits;
  memcpy(&digits.as_int, str, sizeof(digits));
  digits.as_int ^= 0x303030lu;
  digits.as_int <<= (len ^ 3) * 8;
  *num = (uint8_t)((0x640a01 * digits.as_int) >> 16);
  return ((((digits.as_int + 0x767676) | digits.as_int) & 0x808080) == 0) 
   && ((len ^ 3) < 3) 
   && __builtin_bswap32(digits.as_int) <= 0x020505ff;
}

Related posts