How I processed one billion rows in PHP

How I processed one billion rows in PHP

You’ve probably already heard of a competition called The One Billion Row Challenge (1brc), but if not, I suggest you check out by Gunnar Morling’s 1brc repository.

My participation in the project was motivated by the presence in it of two of my colleagues who achieved leading positions.

PHP is not known for its outstanding speed performance. However, given that I’m working on a PHP profiler, I decided to test its performance using this example call.

The first naive approach

I copied the contents of the repository and generated a dataset with a billion records in the measurements.txt text file. After that, I began to develop the initial, simplest version of the algorithm that could perform the task set before me:

<?php

$stations = [];

$fp = fopen('measurements.txt', 'r');

while ($data = fgetcsv($fp, null, ';')) {

   if (!isset($stations[$data[0]])) {

       $stations[$data[0]] = [

            $data[1],

            $data[1],

            $data[1],

            1

        ];

    } else {

        $stations[$data[0]][3] ++;

        $stations[$data[0]][2] += $data[1];

        if ($data[1] < $stations[$data[0]][0]) {

            $stations[$data[0]][0] = $data[1];

        }

        if ($data[1] > $stations[$data[0]][1]) {

            $stations[$data[0]][1] = $data[1]

        }

    }

}

ksort($stations);

echo '{';

foreach($stations as $k=>&$station) {

    $station[2]=$station[2]/$station[3];

    echo $k, '=', $station[0], '/', $station[2], '/', $station[1],

}', ';

echo '}';

There is nothing special about this script: we open the file and read the data using the fgetcsv() function. If the station has not yet been detected, we register it, if it is already in our list, then we increase the counter, add the temperature value and update the minimum and maximum values ​​if necessary.

Once done, I use the ksort() function to sort the $stations array, then output the list of stations, and calculate the average temperature by dividing the total by the number of entries.

Experimenting with this basic code on my laptop showed that it takes time to run 25 minutes 🤯

Now is the time to start optimizing and use the profiler:

Visualization of time intervals allowed me to determine that the task is limited by the capabilities of the processor (CPU bound). The initial stage of compiling the file has no noticeable impact, and there are no significant garbage collection operations.

Looking at the flame graph is also helpful as it shows that I’m spending 46% of the CPU time on fgetcsv().

Moving from fgetcsv() to fgets()

My first optimization involved replacing fgetcsv() with fgets() to extract the string itself and split it into parts using the “;” character. The motivation for this was that fgetcsv() does more operations than needed for my purposes.

// ... while (" class="formula inline">data = fgets(

$pos = strpos($data, ';'); 

$city = substr($data, 0, $pos); 

$temp = substr($data, $pos+1, -1);

 // …

Additionally, I renamed $data[0] in $city and $data[1] in $temp everywhere.

After running the script again with this change, the execution time was reduced to 19 minutes 49 seconds. Although this is still quite a long period, we are watching 21% decrease compared to previous indicators.

The flame graph shows the changes made, and switching to view the CPU execution time by individual lines of code also shows the processes taking place in the main block of code:

I’m spending ~38% CPU time on lines 18 and 23, which look like this:

18 | $stations[$city][3] ++;

   | // ...

23 | if ($temp > $stations[$city][1]) {

Line 18 displays the first access to the array

References are used where possible

$station = &$stations[$city];

$station[3] ++;

$station[2] += $temp; 

// вместо

$stations[$city][3] ++;

$stations[$city][2] += $data[1];

That way, PHP won’t look for the key in the $stations array every time it’s accessed, you can think of it as caching to refer to the “current” station in the array.

And this measure turned out to be effective: the execution time was reduced to 17 minutes 48 secondsWhat on 10% faster previous result!

Just one comparison

While going through the code, I came across this snippet:

if ($temp < $station[0]) {

    $station[0] = $temp;

}

if ($temp > $station[1])

    $station[1] = $temp;

}

Considering that if the temperature is below the set minimum, it cannot be above the maximum at the same time, I converted the conditional statement to ‘elseif’, which potentially saves CPU time.

By the way, I don’t know the order of passing the temperatures in the measurements.txt file, but depending on it, it may matter which comparison to make first.

With this optimization in mind, the new version of the script runs in 17 minutes and 30 seconds, which is 2% faster than the previous time. It’s a small improvement, but it beats the usual performance fluctuations.

Embedding an explicit indication of data types

PHP is known for its dynamism and flexibility with data types, which I especially appreciated early on in my programming journey, as it meant one less thing to worry about. However, the exact definition of types can significantly help the interpreter of the code in making optimal decisions during its execution.

$temp = (float)substr($data, $pos+1, -1);

And you know what? This minor change reduced the execution time of the script to nothing 13 minutes 32 secondsgiving improved performance by as much as 21%!

18 | $station = &$stations[$city];

   | // ...

23 | } elseif ($temp >$station[1]) { 

Line 18 still accounts for 11% of the CPU time due to the array access (looking up the key in the hash table that is the basis for PHP’s associative arrays).

CPU cost of line 23 execution decreased from about 32% to 15%. This is because now PHP does not need to perform type conversion operations. Before the introduction of explicit type conversion, the variables $ temp, $ station[0] and $station[1] were interpreted as strings, and PHP had to convert them to real numbers for each comparison.

Let’s take a look at JIT

PHP OPCache is not active by default when using the command line and you need to set a parameter to enable it opcache.enable_cli like him OPCache includes a JIT, which, although enabled by default, does not actually function due to the zero buffer size. I put for opcache.jit-buffer-size size 10M. After making these changes, I re-ran the script with the JIT enabled and the execution time went down to:

7 minutes 19 seconds 🚀

What provided decrease execution duration by 45.9%!

What else can be improved?

I’ve already reduced my turnaround time from 25 minutes over the weekend to 7 minutes. One of the surprising points for me was the fact that the fgets() function uses about 56 GB/m of memory to process a 13 GB file. It looks suspicious so I decided to investigate fgets() function implementation and found that I could reduce the number of memory allocations by removing the len argument from the fgets() call:

while ($data = fgets($fp)){

// вместо

while ($data = fgets($fp, 999)){

Profile comparison before and after changes:

You might assume that this change would result in a noticeable increase in performance, but the gain was only about 1%. This is explained by the fact that we are talking about small allocations of memory with which the memory manager ZendMM the account of their placement in special containers (bins) is handled very efficiently, ensuring a high speed of operation.

Is it possible to speed up the process even more?

Of course you can! So far I have used the single-threaded approach that is traditional for most PHP scripts, however PHP has the ability to multi-thread at the user level using parallel extension.

Profiler analysis suggests that reading data is a weak point. Replacing fgetcsv() with fgets() followed by manual line splitting has worked, but the process is still time-consuming. Therefore, I suggest using multithreading to simultaneously read data and process it, after which we can collect the results from all threads together.

$file="measurements.txt";

$threads_cnt = 16;

/**

 * Get the chunks that each thread needs to process with start and end position.

 * These positions are aligned to \n chars because we use `fgets()` to read

 * which itself reads till a \n character.

 *

 * @return array<int, array{0: int, 1: int}>

 */

function get_file_chunks(string $file, int $cpu_count): array {

    $size = filesize($file);

    if ($cpu_count == 1) {

        $chunk_size = $size;

    } else {

        $chunk_size = (int) ($size / $cpu_count); 

    }

    $fp = fopen($file, 'rb');

    $chunks = [];

    $chunk_start = 0;

    while ($chunk_start < $size {

        $chunk_end = min($size, $chunk_start + $chunk_size); 

        if ($chunk_end < $size) { 

            fseek($fp, $chunk_end); 

            fgets($fp); // moves fp to next \n char

            $chunk_end = ftell($fp)

        }

        $chunks[] = [

            $chunk_start,

            $chunk_end

        ];

        $chunk_start = $chunk_end+1;

    }

    fclose($fp);

    return $chunks; 

}

/**

 * This function will open the file passed in `$file` and read and process the

 * data from `$chunk_start` to `$chunk_end`.

 *

 * The returned array has the name of the city as the key and an array as the

 * value, containing the min temp in key 0, the max temp in key 1, the sum of

 * all temperatures in key 2 and count of temperatures in key 3.

 *

 * @return array<string, array{0: float, 1: float, 2: float, 3: int}>

 */ 

$process_chunk = function (string $file, int $chunk_start, int $chunk_end): array{

    $stations = [];

    $fp = fopen($file, 'rb');

    fseek($fp, $chunk_start);

    while ($data = fgets($fp)){

        $chunk_start += strlen($data);

        if ($chunk_start > $chunk_end) {

            break;

        }

        $pos2 = strpos($data, ';');

        $city = substr($data, 0, $pos2);

        $temp = (float)substr($data, $pos2+1, -1);

        if (isset($stations[$city])){

            $station = &$stations[&city];

            $station[3] ++;

            $station[2] += $temp;

            if ($temp < $station[0]) {

                $station[0] = $temp;

            } elseif ($temp > $station[1]) {

                $station[1] = $temp;

            }

        } else {

            $stations[$city] = [

                $temp,

                $temp,

                $temp,

                1

            ];

        }

    }

    return $stations;

};

$chunks = get_file_chunks($file, $threads_cnt);

$futures = [];

for ($i = 0; $i < $threads_cnt; $i++) {

    $runtime = new \parallel\Runtime();

    $futures[$i] = $runtime->run(

        $process_chunk,

        [

            $file,

            $chunks[$i][0],

            $chunks[$i][1],

        ]

    );

}

$results = [];

for ($i = 0; $i < $threads_cnt; $i++) {

    // `value()` blocks until a result is available, so the main thread waits

    // for the thread to finish

    $chunk_result = $futures[$i]->value();

    foreach ($chunk_result as $city => $measurement) {

        if (isset($results[$city])) {

            $result = &$results[$city];

            $result[2] += $measurement[2];

            $result[3] += $measurement[3];

            if ($measurement[0] < $result[0]) {

                $result[0] = $measurement[0];

            }

            if ($measurement[1] < $result[1]) {

                $result[1] = $measurement[1];

            }

        } else {

            $results[$city] = $measurement;

        }

    }

}

ksort($results);

echo '{', PHP_EOL;

foreach($results as $k=>&$station) {

    echo "\t", $k, '=', $station[0], '/', ($station[2]/$station[3]), '/', $station[1], ',', PHP_EOL; 

}

echo '}', PHP_EOL;

This script performs several operations: first, I split the file into parts taking into account line boundaries (\n) in order to use the fgets() function later. Once the file parts are defined, I trigger $threads_cnt worker threads. Each of the threads opens the same file and moves to the starting point of its data segment, then reads and processes the data until the segment is complete, returning intermediate results. Finally, the main thread combines these results, performs sorting, and outputs them.

This multi-threaded approach completes only if:

1 minute 35 seconds 🚀

This is the end?

No way. Regarding this decision, I want to note at least two aspects:

This code was run on MacOS using an Apple Silicon processor where available problems using JIT in the version of PHP compiled with ZTS support, resulting in a runtime of 1 minute 35 seconds without JIT. Perhaps the results could be better with JIT enabled.

I realized that I was using a version of PHP compiled with CFLAGS=”-g -O0 …” flags, which suits my daily debugging needs 🤦.

I should have checked this from the start as well, so I recompiled PHP 8.3 with the CLFAGS=”-Os …” flags, and I ended up with the final runtime with 16 threads:

🚀 27.7 seconds 🚀

This result is quite different from what you might have seen in the leaderboard of the original task, which is due to the use of different hardware.

Here’s the timeline when using 10 threads:

The bottom thread in the graph is the main thread that waits for results from the worker threads. At the moment of receiving intermediate results from work streams, the main stream begins their aggregation and sorting. It is clear that the main stream has ceased to be the bottleneck of the process. Therefore, if the goal is to achieve even greater optimization, it is worth focusing on the work of workflows.

What did I learn?

Each level of abstraction is a trade-off between convenience and the cost of CPU cycles or memory. fgetcsv() is easy to use and masks a lot of detail, but it comes at a cost. Even fgets(), although it hides some points, makes the process of reading data much easier.

Typing in code can help a language optimize performance or eliminate seemingly invisible type conversions that cost CPU time.

JIT is a really powerful tool, especially when it comes to solving CPU related tasks!

This may not be typical of most PHP programs, but thanks to parallelism (using ext-parallel) we were able to significantly speed up the work.

Conclusion

I hope you had as much fun reading this article as I had writing it. If you want to dive deeper into code optimization, feel free to get involved and share your comments about your results.

Related posts