Good binary search without branching

Short description

A binary search algorithm called “Shor’s algorithm” has been described in a blog post by Alex Muskar. The algorithm is branchless, meaning there is no branching, and a translation to C++ iterators has been created using the algorithm. It is more than twice as fast as std::lower_bound in GCC and significantly faster for smaller arrays than Clang’s std::lower_bound. However, when comparing strings and other comparison functions with more loops, std::lower_bound may still be preferable. The idea for Shor’s algorithm was originally published in a book by John Bentley in 1982 and is attributed to Leonard Shore.

Good binary search without branching

I recently read a post by Alex Muskar

Beautiful Binary Search in D

. It describes a binary search algorithm called “Shor’s algorithm”. I’ve never heard of it and it’s impossible to google, but when I saw the algorithm, I could only think of one thing: it’s branchless. Who knew there could be binary search without branching? So I set about translating it into an algorithm for C++ iterators that doesn’t require unit-based indexing or fixed-size arrays.

In GCC, it is more than twice as fast as std::lower_bound, which itself is a very high-quality binary search. The search loop is simple and the assembly code generated is beautiful. I was amazed that it exists, but no one seems to be using it.


Let’s start with the code:

template<typename It, typename T, typename Cmp>
It branchless_lower_bound(It begin, It end, const T & value, Cmp && compare)
{
    size_t length = end - begin;
    if (length == 0)
        return end;
    size_t step = bit_floor(length);
    if (step != length && compare(begin[step], value))
    {
        length -= step + 1;
        if (length == 0)
            return end;
        step = bit_ceil(length);
        begin = end - step;
    }
    for (step /= 2; step != 0; step /= 2)
    {
        if (compare(begin[step], value))
            begin += step;
    }
    return begin + compare(*begin, value);
}
template<typename It, typename T>
It branchless_lower_bound(It begin, It end, const T & value)
{
    return branchless_lower_bound(begin, end, value, std::less<>{});
}

I said the search loop is simple, but unfortunately lines 4-15 are not so simple. Let’s skip them for now. The main part of the work is done in rows 16-20.

Without branching

Maybe the loop doesn’t look like it doesn’t have a branch because I’ve obviously made the loop conditional and there’s an if statement in the body of the loop. Let me prove it:

  • The if statement will be compiled into a CMOV (conditional copy) command, meaning there is no branching. At least that’s what GCC does. I couldn’t get Clang to make this code branch-free, no matter what tricks I tried. So I decided not to understand because it worked for GCC. I wish C++ would just allow using CMOV directly
  • The loop condition is a branch, but depends only on the length of the array. Therefore, it can be predicted very well and we do not need to worry about it. In Alex’s post, the loop is completely unwrapped, which eliminates the branching, but in my benchmarks the unwrapping was slower because the function body was too large to inline. So I left it unchanged.

Algorithm

So, after we explained that the title means the disappearance of one branch and a practically free second, which can be removed if necessary, let’s tell you how it all works.

becomes an important variable here step from line 7. We will make jumps with a step of powers of two. If the array has a length of 64 elements, then we will have the values ​​64, 32, 16, 8, 4, 2, 1. It is initialized to the nearest lower power of two from the length of the input data. So if the input is 22 elements long, the value will be 16. My compiler doesn’t have a new std::bit_floor function, so I wrote my own to round to the nearest power of two. After C++20 support is extended, it will need to be replaced by the std::bit_floor call.

We will always take steps of size in powers of two, but this becomes a problem if the length of the input data is not a power of two. Therefore, in lines 8-15, we check whether the middle of the value being searched for is less. If so, the search will be performed among the last items. For example: if the input is 22 in length and this boolean is false, then we search the first 16 elements from index 0 to 15. If the conditional construct is true, we search the last eight elements from index 14 to 21.

input          0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
line 8 compare                                       16
when false     0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
when true                                      14 15 16 17 18 19 20 21

Yes, this means that indices 14, 15, and 16 are included in the second part, even though we already excluded by comparison in line 8, but that’s the price of a convenient loop. We have to round to the power of two.

Productivity

What is the performance of this code? In GCC, it’s incredibly fast:

At about 16,000 elements in the array, it is three times faster than std::lower_bound. Then the effect of the cache starts to dominate, so reducing branch misses is less important.

The spikes for std::lower_bound occur in powers of two, which are much slower for some reason. I’ve researched the question a bit, but couldn’t find a simple explanation. The Clang version has the same bursts, although the algorithm is compiled into assembly code, which is very different.

The fact that branchless_lower_bound is slower than std::lower_bound in Clang is because I failed to get rid of branching:

The funny thing is that Clang compiles std::lower_bound so that branching is lost. So std::lower_bound is faster in Clang than GCC, and my branchless_lower_bound turns out to be branching. Not only the red line rises, but the blue line also falls.

But this means that if we compare Clang’s std::lower_bound version with GCC’s branchless_lower_bound version, we will be comparing two branchless algorithms. Let’s do it:

The branchless version of branchless_lower_bound is faster than the branchless version of std::lower_bound. In the left part of the graph, where the arrays are smaller, it is on average one and a half times faster. Why? Mainly through a solid internal cycle. Here is the assembly code for the two versions:

These are fairly low-cost operations with only a small amount of parallelism at the instruction level (each iteration of the loop depends on the previous one, so in both cases the number of instructions per clock is small), so we can estimate their costs simply by counting them. The difference between 13 and 5 is serious. Two differences are particularly important:

  1. branchless_lower_bound should only track one pointer instead of one
  2. std::lower_bound should recalculate the size after each iteration. In branchless_lower_bound, the size of the next iteration does not depend on the previous one

That’s great, but the comparison function is passed by the user, and if it’s much larger, it can end up with a lot more loops than we have. In this case, branchless_lower_bound will be slower than std::lower_bound. Here’s a binary search for strings that gets more expensive as the container grows:

More comparisons

Why is it slower for strings? Because it performs more comparisons than std::lower_bound. Division by powers of two is actually not ideal. For example, if the input data is an array [0, 1, 2, 3, 4]and we will look for the average, i.e. element 2, then the function will behave quite badly:

That is, we perform four comparisons when std::lower_bound only needs two. I chose an example where the situation is particularly awkward, we start from the middle and compare the same index twice. It looks like it could be improved, but when I tried to do it, the algorithm just got slower.

But it will not be much worse than a perfect binary search. For an array with fewer than elements

– perfect binary search uses or fewer comparisons

– branchless_lower_bound uses or fewer comparisons.

All in all, it’s worth it: We do more iterations, but we do them so quickly that the result is much faster. We need to remember that the comparison function is also expensive, so std::lower_bound may be the right choice.

We are looking for a source

At the beginning of the article, I said that Shar’s algorithm is impossible to Google. Alex Muskar wrote that he read about it in a book written in 1982 by John Bentley. Fortunately, this book can be borrowed from the Internet Archive. Bentley provides the source code and says that the idea is taken from Knuth’s Sorting and Searching. Whip does not provide source code. In his book, he only sketched the idea and said he got it from Leonard Shore in 1971. I don’t know where Shor wrote down his idea. Maybe he just told her to Cnut.

This is the second time I have found an amazing algorithm from Knuth’s books, which should be used more actively, but which somehow turned out to be forgotten. Maybe I should read the book carefully to see which ideas are good and which are bad. For example, immediately after the exposition of Shore’s algorithm, Knuth devotes much more time to the consideration of binary search based on the Fibonacci sequence. It is faster if you cannot quickly divide integers by 2 and you can only use addition and subtraction. So it’s probably useless, but who knows? When reading Knuth’s book, one has to assume that most algorithms are useless, and that everything good has already been found by someone. Fortunately for people like me, hidden treasures are still discoverable.

Code

The code is laid out

here

. It is released under the Boost license.

Related posts