“Algorithms” vs “Data Structures”

“Algorithms” vs “Data Structures”

And hello again!

The other day I was lucky enough to write about algorithms. And I wrote all kinds of things about them. And the fact that I twisted them in the causal places, and the fact that I can’t imagine my life as a programmer without them in critical cases, important moments, difficult choices. Well, once I took on this work, nothing motivates me to continue it like positive reviews.

Therefore, I decided to continue this topic. Although, for the sake of the truth and passing the polygraph for, I would still not be able to resist and write something more about it.

Today I decided to write about data structure algorithms. About those that I can remember. And, as our dear boss says: “Go away!”

We see the words “Algorithms” and “Data Structures” together so often that we have long been accustomed to their “sameness” (I’m inventing words). And it seems to us that one cannot exist without the other. But it seems to me that I should also express my opinion on this matter, put theses into words, because my opinion may be “disagree with the party’s policy”.

So, where does this identification of one with the other come from? Just like the party and Lenin, because when they talk about algorithms, they immediately mean data structures, and when they talk about data structures, algorithms immediately appear.

And it seems to me that these concepts are confused with each other, like quantum states (or what is confused forever?). But I propose to understand and remember with a kind word both that and the other (and without bread).

No, when we are asked separately about data structures, I think we can somehow think of them as separate from algorithms. Here we have an array, a structure, a list, a dictionary, a table, a hash table, a tree, and something else. And we separate their difference from algorithms. But in general, algorithms without data structures rarely give anything on their own. But the data structures themselves often have the very algorithms under their hood, which I read about in the book “Pascal in Illustration” mentioned in the previous post. So let’s separate the algorithms in general from the algorithms inside the data structures and run through those structures from the perspective of those algorithms.

Data structures

Programmers invented so many different structures during their short era that I probably won’t tell about everything here. But I will try to go through the most useful ones. The previous article talked about the fact that the algorithms themselves are just that, not particularly who needs them. Yes, it is good if someone understands or has not yet learned how to think, but even without this, the naive way of sorting through elements can solve most problems in some O(N). But the content of which structures the program should sort through — the programmer should know, well, at least a little, but he should. I hope that no one will object to such a thesis. Well, if he denies it, then he and Khabr to deny it.

Well, let’s start with the efficiency of algorithms in general, and then indicate this efficiency for each of the considered structures. Well, to speak the same language.

I already mentioned some O(N) above. So “O” here is “efficiency” and N is the number of iterations to calculate the result. Efficiency also applies to the amount of memory, so an algorithm can be very good in terms of iterations, but very bad in terms of memory – we will also consider all of this.

ARRAYS

So, let’s start with arrays. Let’s take a sequential list of elements as an array. In general, an array is a sequential area of ​​memory containing either uniform values ​​or their addresses. An array element is accessed by index and the efficiency of this access is O(1). That is, When accessing an array element, the program simply calculates its position relative to the starting address by multiplying the size of the element (or address) by the index. If the memory is free, you can easily add an element to the array — it also takes one O(1) iteration. But inserting an element in the middle forces the rest of the elements to move up, so the efficiency of such an insertion is from conventional O(2) — the penultimate element, to O(N) — if we suddenly want to insert at the beginning. The time to search for a value among the values ​​of the array already tends to O(N), especially if there is no such value there. It is easy to swap elements in an array, so it is easy to sort it with a huge number of different algorithms – from children’s pyramid building to all these “merger sorts”, which, let’s say, are implemented in various libraries for various languages. In 1c, for example, there is no method for sorting an array, so the brave 1C-negs load the array into a list of values ​​that can be ordered, and then unload it back. No one goes through all these thoughts about writing their own sort. And that’s right, right?

In general, an array is as simple as two times two. Therefore, I use it very rarely. And how are you doing with this? Tell us in the comments.

STRUCTURES

After the array, I have it in order of increasing complexity structure. They are available in many languages. Actually, the structure is a set of named elements. The structure can be imagined as a row of a database, each column of which contains its own types (strings, numbers, dates, and others). A structure is generally a “closed” thing, but in the same 1C you can insert keys with values ​​and even without them. Here the structure turns into the following storage “Key”: “Value”. And if in low-level languages ​​the access to the fields of the structure is compiled with O(1) efficiency, then in high-level languages ​​that involve changing the data set, everything is not so clear. As a result, the structure there is most likely implemented as a linked list, which we will talk about.

LISTS

list is, in my opinion, some sort of linked data set, where each element points to where the next one is. And if in the array you can refer to the element by index, then in the structure you can’t – you can only refer to the next element, which will contain the address of the next one, and so on. Often, list items have links to the next item as well as the previous item. Among other things, lists can always be in an ordered state – this is achieved due to the fact that the insertion between any two of its elements is very efficient. In fact, a new list element is created, the addresses of the next and previous elements are specified in it, and the address of the next or previous element is changed for these elements. At the same time, the insertion into a regular list has, in principle, an efficiency of O(1), because this element is attached to the last (well, there is no difference between the first and the first). And inserting into an ordered list has O(Log2N) efficiency. Such efficiency is due to the same algorithm for binary search of the insertion point that I talked about in the last article.

By the way, std::map is such a linked list. You can estimate the efficiency of accessing N elements or the efficiency of searching for a value in such a data structure.

HASH TABLES

Hash tables – also a very interesting data structure, the story of which can help you look at this thing in a new way. In C, there is a std::unordered_map for such a data structure, in 1C such a structure is definitely correspondence. Python also has a dictionary. I wasn’t particularly interested in other languages, but in the years of my youth there was nothing like that in Pascal, BASIC, or even more so in assembler.

So, let’s try to figure out how such associative arrays work and what is the advantage of using them.

Such a table is made up of records that are stored in memory in the same way as arrays — sequentially. But the index of the array is obtained by calculating the hash function from the key. And the more bits the hash function returns, the more memory the hash table has. Do you feel immersed in the details? So, if we want to use some MD5 as a hash function, then we will need to place 2^128 addresses somewhere even for one element, but access to this element will have conditionally O(1) efficiency. And it is clear that programming languages ​​cut the hash function down to an acceptable size, creating 2^8/12/16/24 elements, changing the bit rate as the number of collisions increases. By the way, let’s talk about them (collisions) too.

COLLISIONS IN HASH TABLES

Collisions is when the hash function for two different values ​​returned the same index. Do you know how to estimate the maximum size of collisions? Everything is simple there. If we have a hash function whose result takes up N bits, the number of collisions for keys that are M bits long is 2^(MN). To liven up the example, let’s take MD5, which returns a 128-bit result. 128 bits is 16 bytes. Let’s say we have many files that are 20 bytes long. And if we take all the variants of these 20 bytes in the world, then for each hash sum there will be 2 (160-128 = 32) collisions. That is, 4 billion. Let’s say you don’t need all the variants of 20-byte files – we and our files have enough of different lengths. But keep in mind that there may be files on your computer that have the same MD5.

So, when our hash function generates a collision, that is, it turns out that there is already something in the cell with such a hash, then the data can be converted to an array, a list, and anything else at the address. On the third collision, another entry will simply be added to this array or list. Well, it is clear that the keys must also be stored somewhere, so that there is something to compare with when looking for the right key.

Well, after the hash function is recognized as degenerate (after overcoming some threshold of collisions), the hash table must be rebuilt with a new hash function of a higher bit rate. Therefore, insertion into such a table does not always occur with O(1) efficiency, and with a hash function tending to degeneracy, and the efficiency of reading from this data structure avoids O(1). But if the hash function is chosen well, ensuring the minimum number of collisions for the processed data set, the performance of this structure is excellent.

CONCLUSION

Yes, again I did not bring any code on any assemblers, sya, pythons and even 1Cs. But what can you do – I am a lazy creature, like all living things. Maybe the berries will accumulate in the buttocks, and I will give birth to a more technical article. But I hope that the verbal and numerical description of data structures will shed some light on how to use them, with a little understanding of the algorithms that run in them. I hope that I wrote all this not in vain.

Subscribe to my channel, like (c)

Related posts