Just don’t copy it

Just don’t copy it

What I am going to talk about in the article is so trivial that any, even a beginner, developer already knows it – I really hope so. However, the code coming from the non-review shows that people have done, and continue to do, something like this:

bool LoadAnimation(str::string filename);
void DrawLines(std::vector path);
Matrix RotateObject(Matrix m, Angle angle);
int DrawSprite(Sprite sprite);

What do these functions have in common? Argument by value. And every time a similar function is called in code, a copy of the input data is created in its time context and passed inside the function. And you can still forgive code that is rarely called, like loading animations, or hope for a compiler that optimizations will work and it will destroy data copying, but unfortunately, most often this approach to development destroys only performance and fps.


Any optimizations must be approached only! after analyzing them in the profiler, it turns out that copies may not be an expensive operation. This, for example, depends on the size of the object, so the compiler does an excellent job of transferring objects by value up to 32 bytes, there are of course costs, but they are very insignificant and are not caught on benchmarks. The vendor can screw “something like this into the platform and compiler” that copying 32kb from special memory chips will be faster than adding a couple of numbers. And in the game itself, “hot code” optimization, let’s be honest, is often not the biggest problem of overall performance. But dynamic memory allocation can bring many surprises, especially when used thoughtlessly.

But even if the overhead is small, does it make sense to waste CPU cycles when it can be avoided? Here are these “lost 2-3%” of smeared perf, which do not even light up in the profiler, it is very difficult to catch later, and it is even more difficult to fix.

Hidden allocation on rows
#include 
#include 

size_t PassStringByValueImpl(std::string str) {
    return std::accumulate(str.begin(), str.end(), 0, [] (size_t v, char a) { 
        return (v += (a == ' ') ? 1 : 0);
    });
}

size_t PassStringByRefImpl(const std::string& str) {
    return std::accumulate(str.begin(), str.end(), 0, [] (size_t v, char a) { 
        return (v += (a == ' ') ? 1 : 0);
    });
}

const std::string LONG_STR("a long string that can't use Small String Optimization");

void PassStringByValue(benchmark::State& state) {
    for (auto _ : state) {
        size_t n = PassStringByValueImpl(LONG_STR);
        benchmark::DoNotOptimize(n);
    }
}
BENCHMARK(PassStringByValue);

void PassStringByRef(benchmark::State& state) {
    for (auto _ : state) {
        size_t n = PassStringByRefImpl(LONG_STR);
        benchmark::DoNotOptimize(n);
    }
}
BENCHMARK(PassStringByRef);

void PassStringByNone(benchmark::State& state) {
    for (auto _ : state) {
        size_t n = 0;
        benchmark::DoNotOptimize(n);
    }
}
BENCHMARK(PassStringByNone);

QuickBench: https://quick-bench.com/q/f6sBUE7FdwLdsU47G26yPOnnViY

Hidden allocation on arrays
size_t SumValueImpl(std::vector vect)
{
    size_t sum = 0;
    for(unsigned val: vect) { sum += val; }
    return sum;
}

size_t SumRefImpl(const std::vector& vect)
{
    size_t sum = 0;
    for(unsigned val: vect) { sum += val; }
    return sum;
}

const std::vector vect_in = { 1, 2, 3, 4, 5 };

void PassVectorByValue(benchmark::State& state) {
    for (auto _ : state) {
        size_t n = SumValueImpl(vect_in);
        benchmark::DoNotOptimize(n);
    }
}
BENCHMARK(PassVectorByValue);

void PassVectorByRef(benchmark::State& state) {
    for (auto _ : state) {
        size_t n = SumRefImpl(vect_in);
        benchmark::DoNotOptimize(n);
    }
}
BENCHMARK(PassVectorByRef);

void PassVectorByNone(benchmark::State& state) {
    for (auto _ : state) {
        size_t n = 0;
        benchmark::DoNotOptimize(n);
    }
}
BENCHMARK(PassVectorByNone);

QuickBench: https://quick-bench.com/q/GU68xgT0r97eYaCKxMzm9bXJei4

The compiler is smarter anyway

In cases where the copied object is less than a couple of tens of bytes, we will not notice any difference at all from transferring it by link, to the point that the generated assembler code will be “almost the same”.

There is a copy, but it does not affect
struct Vector{
    double x;
    double y;
    double z;
};

double DotProductObjectImpl(Vector a, Vector p){
    return (a.x*p.x + a.y*p.y + a.z*p.z);
}

double DotProductRefImpl(const Vector& a, const Vector& p){
    return (a.x*p.x + a.y*p.y + a.z*p.z);
}

void DotProductObject(benchmark::State& state) {
    Vector in = {1,2,3};
    for (auto _ : state) {
        double dp = DotProductObjectImpl(in, in);
        benchmark::DoNotOptimize(dp);
    }
}
BENCHMARK(DotProductObject);

void DotProductRef(benchmark::State& state) {
    Vector in = {1,2,3};
    for (auto _ : state) {
        double dp = DotProductObjectImpl(in, in);
        benchmark::DoNotOptimize(dp);
    }
}
BENCHMARK(DotProductRef);

void DotProductNone(benchmark::State& state) {
    for (auto _ : state) {
        size_t n = 0;
        benchmark::DoNotOptimize(n);
    }
}
BENCHMARK(DotProductNone);

QuickBench: https://quick-bench.com/q/drlH-a9o4ejvWP87neq7KAyyA8o

In this example, of course, we know the size of the structure, and the example is very simple. On the other hand, if the transfer over the link is clearly not slower than the value, use const& will be “such best as we can”. And the transfer of primitive types for const& and doesn’t affect anything at all when compiling with the flag /Ox

And if there are no advantages, write something similar const int &i does not make sense, but some still write.

Reserve my vector

Arrays have a huge advantage compared to other data structures, which often overrides any convenience of other containers: their elements go in memory one after the other. We can discuss for a long time the effect of the used algorithm on the running time and how it can affect performance, but we don’t have anything faster than a cacheline processor, and the more elements there are in the cache, the faster the most banal search will work. Any access outside the L1 cache immediately increases the runtime.

But when working with vectors (dynamic arrays), many people forget or don’t remember what’s under the hood. And there, if the allocated space has run out, and it was, for example, allocated for 1 (one) element, then:

  1. A new, larger memory block is allocated.

  2. All elements saved to the new block are copied.

  3. The old memory block is deleted

All these operations are expensive, very fast, but equally expensive. And they happen under the hood and are not visible:
– if you don’t mind looking at the code of the standard library
– do not look at the allocation profiler
– trust the code written by the vendor (although here you will have to accept things as they are)

Use reserve, Luke
static void NoReserve(benchmark::State& state) 
{
  for (auto _ : state) {
    // create a vector and add 100 elements
    std::vector v;
    for(size_t i=0; i v;
    v.reserve(100);
    for(size_t i=0; i v;
  for (auto _ : state) {
    benchmark::DoNotOptimize(v);
  }
}
BENCHMARK(CycleNone);

QuickBench: https://quick-bench.com/q/OuiFp3VOZKNKaAZgM_0DkJxRock

And finally, an example of how you can kill a perf and get a sag to 10 FPS in a flat place, just by moving the mouse on the playing field. I will not name the engine, the bug has already been fixed. If you find an error, write in the comments 🙂

bool findPath(Vector2 start, Vector2 finish) {
   ...

   while (toVisit.empty() == false) 
   {
      ...

      if (result == OBSTACLE_OBJECT_IN_WAY)
      {
         ...

         const std::vector directions{ {1.f, 0.f}, {-1.f, 0.f}, {0.f, 1.f}, {0.f, -1.f} };
         for (const auto& dir : directions)
         {
            auto nextPos = currentPosition + dir;
            if (visited.find(nextPos) == visited.end())
            {
               toVisit.push({ nextPos, Vector2::DistanceSq(center, nextPos) });
            }
         }
      }
   }
}

Related posts