Treadpool spell of Dotnet daemons on Linux
Everyone has heard that sometimes dotnet on Linux consumes more resources than Windows. Sometimes this difference is almost imperceptible. But it also happens that the same application consumes 2-3 times more CPU on Linux than on Windows.
An artistic digression, a joke on the subject. To find out the most interesting thing, it is not necessary to read the text under the cat.
In the alien environment of the Immaterium, almost everything behaves strangely. Many laws of nature do not work. More precisely, they do not work as everyone is used to. Even time behaves a little differently.
Magos Technicus G was immersed in studying how to make the usual algorithms and mechanisms work in the aggressive environment of the Warp at least acceptable to his usual understanding.
Only high technopriests have access to ancient perf tablets with arcane and forbidden spells. Only those hardened by centuries of intellectual effort toward the glory of Omnisia are sufficiently prepared to take advantage of them. After all, in order to understand the spells, you need to dive into the depths of your mind, touch the Warp and prevent it from devouring you.
After a week of deep meditation, Magos G had an idea: it is necessary to change the limit of rotation of the unfair semaphore.
There are many reasons for the difference in CPU consumption of dotnet applications under different OSes, and they are all different. Very briefly, the implementation of large numbers of primitives or even large chunks of logic is different. And in each specific case, something different can play.
Some of these differences (or even “problems”) will eventually be ironed out, improved if possible by dotnet itself. For some “features” in certain versions of dotnet, experimental features appear that you can manage yourself. Sometimes these features migrate to new versions, enabled by default.
Not surprisingly, most of the performance degradation of dotnet on Linux is around asynchronous work, around the threadpool. At some point, the dotnet developers even rewrote the threadpool code from native to managed-C# code, so that it would at least try to be similar under different OSes. But the basic primitives for asynchronous work are still very different in different OSes – even the set of asynchronous methods in the API of operating systems is different. Not every async method is actually honest and asynchronous everyone OS
Contents
Let’s describe the situation
There is this type of programs that often “do nothing”. They are waiting for something, ready to start doing the work as soon as possible, as soon as it appears. They need to start implementing it as quickly as possible. And there is no pre-known timetable for the occurrence of such tasks. At the same time, the pattern of occurrence of this explosive work — if it appears, then there are many at once, for a lot of flows.
Why are they of interest to us? Because there are many of them. For example, our company has about 600 such daemons in a test environment in one test system. We will deal with them further.
And we are also interested in them because the total consumption of resources has increased a little more than twice since the transition from Windows to Linux.
What was the CPU used for?
Unfortunately, no widely available and popular way of diagnosing CPU consumption worked in this case. All the tools were showing “all CPU is spent somewhere in the threadpool”. Sometimes going down to the maximum of such specifics: PortableThreadPool.WorkerThread.WorkerThreadStart(). This was not enough.
A tool came to the rescue perf
. You will easily find how to use it. And with difficulty, but you will still be able to analyze any complex application.
We will not go into the details of the study of artifacts. But everything pointed to the fact that the CPU is wasted in SpinWaits inside the semaphore.
What is SpinWait?
This question is perfectly answered by the documentation from the Microsoft website:
SpinWait is a lightweight synchronization type that you can use in low-level scripts to avoid expensive context switches and kernel transitions required by kernel events.
How does SpinWait work on fingers? It just burns a few CPU clocks with some useless work, roughly equal to a few tens of nanoseconds in time.
It can be assumed that the authors of the treadpool consider the work in the method WorkerThreadStart
inside the taken semaphore is very short. And if the semaphore is currently occupied by someone, then perhaps it is necessary to skip just a few processor cycles, and the semaphore will become empty. And it should be much cheaper and faster than falling into a real Wait. Because a real Wait will do a thread yield — that is, do a context switch and return the thread to the thread scheduler. And this is a very expensive and long operation. Usually much more expensive than a couple of missed CPU cycles.
Why is it more expensive on Linux?
And the devil knows him. It just works differently and that’s all. Not worse, not better otherwise. And the behavior of dotnet, its use of such primitives, needs to be configured in different ways depending on the OS.
What shall we do?
Judging by codethe semaphore is configured to be limited to 70 SpinWait iterations by default. And – surprisingly – this value is configured by an environment variable!
What will happen if this number is reduced? For example, write 0 there?
We write the environment variable DOTNET_ThreadPool_UnfairSemaphoreSpinLimit=0
all 600+ instances. Let’s recap, let’s look at the graphs of total CPU consumption:
Time to rejoice?
Success? Are we running to set this environment variable to all Linux app applications? In no case.
We theorize what could go wrong
What bad can happen from what we now never do SpinWait
but always fall into honest Wait? This can cause a sharp drop in the throughput of the treadpool.
It is easy to imagine that you have a regular and stable stream of Tasks in your application that are generated and executed very quickly. And treadpool in the method WorkerThreadStart
often runs into a busy semaphore, waits for a while SpinWait
-he waits for the semaphore, takes the pull, and goes to fulfill it. The ratio of useless work (SpinWait) to useful work is minimal. “Idle” time (spent performing useful work) is minimal.
And if SpinWait
-s are not available (or their number is small and not enough to wait for the semaphore), we can theoretically often fall into honest Wait and make a context switch. This will take a lot of our time and the ratio of “idle” time (spent not doing useful work) to time spent doing useful work will increase. The shorter the Tasks and the more they are, the worse this attitude will be.
Therefore, it is highly not recommended to mindlessly touch the variable DOTNET_ThreadPool_UnfairSemaphoreSpinLimit
. First, assess all risks, carefully study how they will affect your program, carefully observe them over time.
Conclusions
-
ThreadPool is an incredibly complex abstraction that allows you to write “multi-threaded” code easily and casually. But sometimes it comes at a huge price.
-
Even the developers of ThreadPool cannot write it so that it is perfect in all corner-cases. In some special situations, it works “imperfectly”.
-
If your program began to consume significantly more CPU when switching from Windows to Linux, you can play with the environment variable
DOTNET_ThreadPool_UnfairSemaphoreSpinLimit
Putting there numbers from 0 to whatever value you want, looking at the defaults in the ThreadPool code. -
However, this will not help every application. And for many, probably, it will even bother them. After all, such silences are not chosen for nothing – they must be good “on average”.
-
This feature, which can be influenced, as well as the environment variable, which can be changed – are not the only ones available to us now.
-
In each subsequent version of dotnet, things may change completely, and the environment variable may no longer be used. Read the changelog.