Contents

- 1 What where When
- 2 Reverse engineering
- 3 GetCurrentThreadId 2**30
- 4 How many keys do we need to check?
- 5 Speed
- 5.1 Naive Python implementation (500 keys per second)
- 5.2 Time works wonders
- 5.3 CUDA First Steps (19166 keys per minute)
- 5.4 Improved sha256 (50000 keys per minute)
- 5.5 Parallel algorithm (105,000 keys per minute)
- 5.6 And finally AES (818000 keys per minute)
- 5.7 …and now in parallel (100 billion keys per minute)

- 6 Where did it all go wrong?

### Short description

For the past two years, researchers have been working on a proof-of-concept decryptor for the Phobos ransomware family. However, due to the complexity of the key layout, it is impractical to use. Despite this, researchers decided to publish their results and tools in the hope that someone else can find them useful. The key layout is composed of several sources of entropy, including QueryPerformanceCounter, GetLocalTime, GetTickCount, GetCurrentProcessId, and GetCurrentThreadId. To brute force the entire key, it would require 2951479051793528258560000000000000000000000 SHA-256 executions. To reduce the complexity, researchers made assumptions about the infection time and the PID/TID of the ransomware process. This reduced the computational complexity to 51 bits of entropy, but still requires a good implementation. A naive Python PoC tested 500 keys per second, and would take 2314 processor days to sort through all keys.

## How we almost broke the Phobos cipher using CUDA

For the past two years, we have been working on a proof-of-concept decryptor for the Phobos ransomware family. For reasons we’ll explain here, it’s impractical. So far we have not been able to use it to help an actual victim. But we decided to publish the results and tools in the hope that someone will find them useful, interesting, or continue their research.

We will describe the vulnerability and how we reduced the computational complexity and increased the performance of the decryptor to achieve a near-practical implementation. Continuation __before the start of our “White Hacker” course__

## What where When

Phobos is the inner and largest natural satellite of Mars. And this is the name of the most common ransomware family — Phobos, first spotted in early 2019. The program is not very interesting – it has a lot in common with “Dharma” and is probably written by the same authors.

We started the research after several important Polish organizations were encrypted by Phobos in a short time. It became clear that the unusual function of the Phobos key layout; you can even say that it is broken. This prompted further research in hopes of creating a decryptor.

## Reverse engineering

Let’s skip the boring details: debugging protection functions, shadow copy removal, main disk bypass function, etc. The most interesting feature for us is the key layout:

Interestingly, instead of one good source of entropy, the author of the malicious solution decided to use several bad ones. The list includes:

- Two calls to QueryPerformanceCounter().
- GetLocalTime() + SystemTimeToFileTime().
- GetTickCount().
- GetCurrentProcessId().
- GetCurrentThreadId().

Finally, a variable but deterministic number of SHA-256 rounds is applied. On average, key verification requires 256 SHA-256 executions and one AES decryption.

This immediately rang alarm bells in our heads. Assuming that the infection time is known with an accuracy of 1 second (for example, through timestamps of files or logs), it turns out that the number of operations required to iterate through each component is:

Operation Number of operations

QueryPerformanceCounter 1 10**7

QueryPerformanceCounter 2 10**7

GetTickCount 10**3

SystemTimeToFileTime 10**5

GetCurrentProcessId 2**30

## GetCurrentThreadId 2**30

It is obvious that each component can be bruteforced independently, because 10**7 is not so much for modern computers. But is it possible to recover the entire key?

## How many keys do we need to check?

### Initial state (141 bits of entropy)

The number of keys for a naive attack by the method of enumeration is equal to:

```
>>> 10**7 * 10**7 * 10**3 * 10**5 * 2**30 * 2**30 * 256
2951479051793528258560000000000000000000000
>>> log(10**7 * 10**7 * 10**3 * 10**5 * 2**30 * 2**30 * 256, 2)
141.08241808752197 # that's a 141-bit number
```

A large, unfathomably large number. No chance for brute force.

But we are not those who give up easily.

### Can I have your PID? (81 bits of entropy)

Let’s start with some assumptions. One thing we’ve already done is, thanks to system/file logs, we know the time to the nearest 1 second. One of the sources of such data can be Windows event logs:

There is no default event that fires for every new process, but with enough analysis, it is often possible to recover the PID and TID of the competing process.

Even if they can’t be set because Windows PIDs are sequential, it’s almost always possible to limit them by a significant amount. That way, we usually don’t have to iterate over the entire space in 2**30.

Assume that we already know the PID and TID (main thread) of the ransomware process. Don’t worry, this is the biggest guess in this entire article. Will it make our situation easier?

```
>>> log(10**7 * 10**7 * 10**3 * 10**5 * 256, 2)
81.08241808752197
```

81 bits of entropy is too much to think of as brute force, but we’re getting there.

### Δt = t1 – t0 (67 bits of entropy)

Here’s another reasonable assumption: two consecutive calls to QueryPerformanceCounter will return similar results. In particular, the second QueryPerformanceCounter will always be slightly larger than the first. There is no need to do a full loop through both counters—we can loop through the first one and then guess the time between runs.

If we take the code as an example, instead of

```
for qpc1 in range(10**7):
for qpc2 in range(10**7):
check(qpc1, qpc2)
```

you can write:

```
for qpc1 in range(10**7):
for qpc_delta in range(10**3):
check(qpc1, qpc1 + qpc_delta)
```

The number 10**3 is determined as sufficient empirically. In most cases this should be enough, although it’s only 1ms, so a very bad context switch will cause a crash. Still, let’s try:

```
>>> log(10**7 * 10**3 * 10**3 * 10**5 * 256, 2)
67.79470570797253
```

### Who needs exact time anyway? (61 bits of entropy)

2**67 sha256 calls is a lot, but the number becomes manageable: coincidentally, it almost exactly matches the current BTC hashrate. If, instead of burning electricity pointlessly, the entire BTC network was repurposed to decrypt Phobos victims, that network would decrypt one victim per second, assuming the PID and TID were known.

So SystemTimeToFileTime can have a precision of 10ms. But GetLocalTime is not:

This means that only 10**3 options need to be iterated over, not 10**5:

```
>>> log(10**7 * 10**3 * 10**3 * 10**3 * 256, 2)
61.150849518197795
```

### Math time (51 bits of entropy)

Perhaps it will be possible to find a better algorithm? There are no more obvious things to optimize.

Note that key[0] equals GetTickCount() \^ QueryPerformanceCounter().Low. A naive traversal algorithm will test all possible values of both components, but in most cases much more can be achieved. Example, `4 ^ 0 == 5 ^ 1 == 6 ^ 2 = ... == 4`

. We only care about the final result, so timer values that end up being the same key can be ignored.

A simple way to do this is as follows:

```
def ranges(fst, snd):
s0, s1 = fst
e0, e1 = snd
out = set()
for i in range(s0, s1 + 1):
for j in range(e0, e1 + 1):
out.add(i ^ j)
return out
```

Unfortunately, this requires quite a lot of processor costs (we want to squeeze out as much performance as possible). It turns out that there is a better recursive algorithm. It allows you not to waste time on duplicates, but it is quite thin and not very elegant:

```
uint64_t fillr(uint64_t x) {
uint64_t r = x;
while (x) {
r = x - 1;
x &= r;
}
return r;
}
uint64_t sigma(uint64_t a, uint64_t b) {
return a | b | fillr(a & b);
}
void merge_xors(
uint64_t s0, uint64_t e0, uint64_t s1, uint64_t e1,
int64_t bit, uint64_t prefix, std::vector<uint32_t> *out
) {
if (bit < 0) {
out->push_back(prefix);
return;
}
uint64_t mask = 1ULL << bit;
uint64_t o = mask - 1ULL;
bool t0 = (s0 & mask) != (e0 & mask);
bool t1 = (s1 & mask) != (e1 & mask);
bool b0 = (s0 & mask) ? 1 : 0;
bool b1 = (s1 & mask) ? 1 : 0;
s0 &= o;
e0 &= o;
s1 &= o;
e1 &= o;
if (t0) {
if (t1) {
uint64_t mx_ac = sigma(s0 ^ o, s1 ^ o);
uint64_t mx_bd = sigma(e0, e1);
uint64_t mx_da = sigma(e1, s0 ^ o);
uint64_t mx_bc = sigma(e0, s1 ^ o);
for (uint64_t i = 0; i < std::max(mx_ac, mx_bd) + 1; i++) {
out->push_back((prefix << (bit+1)) + i);
}
for (uint64_t i = (1UL << bit) + std::min(mx_da^o, mx_bc^o); i < (2UL << bit); i++) {
out->push_back((prefix << (bit+1)) + i);
}
} else {
merge_xors(s0, mask - 1, s1, e1, bit-1, (prefix << 1) ^ b1, out);
merge_xors(0, e0, s1, e1, bit-1, (prefix << 1) ^ b1 ^ 1, out);
}
} else {
if (t1) {
merge_xors(s0, e0, s1, mask - 1, bit-1, (prefix << 1) ^ b0, out);
merge_xors(s0, e0, 0, e1, bit-1, (prefix << 1) ^ b0 ^ 1, out);
} else {
merge_xors(s0, e0, s1, e1, bit-1, (prefix << 1) ^ b0 ^ b1, out);
}
}
}
```

Perhaps there is a simpler or faster algorithm, but the authors did not know about it while working on the decryptor.

Here is the entropy after this change:

```
>>> log(10**7 * 10**3 * 10**3 * 256, 2)
51.18506523353571
```

This was the last difficulty reduction. What is missing here is a good implementation.

## Speed

### Naive Python implementation (500 keys per second)

The original Python PoC tested 500 keys per second. It takes 2314 processor days to sort through 10 11 keys. This is far from practical. But Python is essentially the slowest in terms of high-performance computing. More can be achieved.

Normally, in such situations, we would implement our own decryptor (in C++ or Rust). But here even this is not enough.

### Time works wonders

We decided to achieve maximum performance and implement our solver in CUDA.

### CUDA First Steps (19166 keys per minute)

Our first naive version was able to crack 19,166 keys per minute. We had no experience with CUDA, we made a lot of mistakes; GPU usage statistics looked like this:

### Improved sha256 (50000 keys per minute)

Apparently sha256 was a huge bottleneck: sha256 calls outnumber AES by 256 times. Most of the work here is focused on simplifying and adapting the code.

For example, we built in sha256_update:

Built in sha256_init:

Replaced global arrays with local variables:

Hardly set the data size to 32 bytes:

And made several operations more GPU-friendly, for example, bswap used __byte_perm.

After all, the main cycle has changed:

After this optimization, we realized that the code was doing a lot of unnecessary copying and moving of data. Copies are currently not required:

All this allowed us to increase productivity by 2.5 times – up to 50 thousand keys per minute.

### Parallel algorithm (105,000 keys per minute)

Work is divided into streams and streams perform logical operations. In particular, memcopy to and from the graphics card can run concurrently with our code without losing performance.

By simply changing our code to use threads more efficiently, we were able to double the performance to 105k keys per minute:

### And finally AES (818000 keys per minute)

Despite these changes, we didn’t even try to optimize AES. After all the lessons learned, it was pretty simple. We were just looking for patterns that didn’t work well on the GPU and making the code better. Example:

We changed the naive for loop to a manually expanded version that worked with 32-bit integers.

It may seem like a small thing, but in fact the bandwidth has become much higher:

### …and now in parallel (100 billion keys per minute)

In reserve, we had one last trick to run the same code on multiple GPUs! Conveniently, we have a small GPU cluster at CERT.PL. We dedicated two machines with a total of 12 Nvidia GPUs. The throughput immediately increased to almost 10 million keys per minute.

Thus, iterating over 10**11 keys in a cluster takes only 10,187 seconds (2.82 hours). Sounds appealing, right?

## Where did it all go wrong?

Unfortunately, as we mentioned at the beginning, there are many practical problems that we briefly covered, which even made it difficult to publish a decryptor:

- Knowledge of TID and PID parameters is required. This makes it difficult to automate the decryptor.
- We assume very precise time measurements. Unfortunately, this measurement is complicated by clock drift and noise intentionally introduced into Windows performance counters.
- Not every version of Phobos is vulnerable. Before deploying an expensive decryptor, reverse engineering is required to confirm the family.
- Even after all the improvements, the code is too slow to run on a consumer computer.
- Victims do not want to wait for researchers without a guarantee of success.

That’s why we decided to publish this article and the source code of the (almost working) decryptor. The hope is that this will give malware researchers a new perspective on the subject, and perhaps even decrypt ransomware victims.

We have published the CUDA code. It contains a brief instruction, a configuration example, and a data set to test the application.

Analyzed ransomware sample: 2704e269fb5cf9a02070a0ea07d82dc9d87f2cb95e60cb71d6c6d38b01869f66 | MWDB | VT

Useful theory and interesting practice in our courses:

**Short catalog of courses**

**Data Science and Machine Learning**

**Python, web development**

**Mobile development**

**Java and C#**

**From the basics – in depth**

**And**