Sparse data structures

Sparse data structures

Thinned skies over Diana and Bojack the Horse

I once wrote a post about various interesting data structures. Among them was the so-called sparse set. There we described it in general, omitting some details (which were later added to the article). But besides sparse set there are other sparse data structures! We will look at them today 🙂

Sparse Array (Sparse Matrix)

When we talk about sparse arrays and matrices, we most often mean not storing elements with some value (eg 0 or None/null). For example, let’s look at some vector (in the mathematical sense):

Or on the matrix:

As in vector ANDas well as in the block matrix INmost elements are zero. We can save in such cases if we store only non-zero values. The clearest and simplest solution is to transform it into an associative array:

Or transform into several arrays: in the case of a sparse array – one per coordinate and one per value; and in the case of a sparse matrix – two for coordinates and one for values. For example, for the matrix above, a similar transformation might look like this:

Or transform into a list (one-/two-linked), where the node contains all the coordinates and values ​​at once. Does not matter. The concept is the same everywhere.

Similar approaches help to save time and memory when calculating with some types of matrices: for example, in the case of block/sometimes triangular/diagonal. There are ready-made python packages that implement this logic (scipy.sparse).

Here, of course, there are no catchy ideas. But it’s more interesting!

Sparse List

Sparse list is sparse in another sense. Which? I won’t say yet.

Let’s look at a doubly linked list. We can do modifying operations (insertion, deletion) by O(1) and non-modifying (detour in any direction) for O(n). And for all this, we need only one node of the list (no need to search for a neighboring one, as in the case of a single-linked list. On the other hand, we are forced to store two pointers for each node (in modern realities, this is often as many as 16 bytes!) somehow not greatly impair the speed of operation, while reducing memory consumption?

So! What if we make a list of repeat contacts? Well, that is, his connectivity is something between one and two. We can achieve this by having an average list in which some nodes will be doubly connected.

That is, we have some subset of nodes with two pointers, one of which points to the previous doubly linked node and the other to the next node in the list. Because the size of a single-link “subscription” is finished and fixed in advance, any operation still takes O(1), Just like a doubly linked list (of course, with a greatly increased constant). A thoughtful reader may notice that you can make an insertion in one signature and all our asymptotics will break! But no one prevents us from maintaining the sizes of these single-link subscriptions no larger than a certain size (in Figure 3, if we count the two-link nodes at the ends). That is, if you have already inserted a single-linked node after a single-linked node, then make it double-linked and fix the link in the two-linked neighbors.

Now, of course, it takes a bit more squatting to implement each of the operations, but still, you don’t have to start from the very beginning of your list to find the previous node for some known one. It is enough for you to go forward and find the first two-link node from which you can jump back.

Actually, due to the fact that you have two-link nodes at a certain distance from each other, it is called sparse.

How to choose this same “rare factor”? There is no good answer. Empirically, one can guess that writing such a structure is much easier (and the constant is less) if the nodes simply alternate. In particular, traversing the list in the reverse order will be much cheaper if doubly connected nodes occur more often.

And how to save memory? From a code writing perspective, you might want to use something like union/std::variant, but such types will have the size of the maximum of the objects. That is, as a result, we still have a solution that can be compared to a doubly linked list.

Unfortunately, it’s not good at all. In any case, you need to be able to work with nodes of different sizes (type-erasure can help in some way). Plus, you need to place the pointer to the previous node last in the structure, so that it is more convenient to take the pointer to the next node/value from the list. But the task of determining the type of nodes is not difficult. Because pointers are aligned in memory, you always have some low-order zero bits. It is from the pointer to the next element that one of these bits can be taken and set to 1, if the node also has a reverse pointer (it is important not to forget to set the bit to zero before requesting this pointer to the next node). And problems start when you want to move a node from one list to another: if it can change its type, you will need to free/allocate memory again. If you wish, you can not store values ​​at all in two-link nodes, but the problems almost do not go anywhere + it is not clear how much we save anything at all.

Here’s a link to the implementation if you’re really interested in messing around.

Sparse Deque (sparque)

Sparse deque was recently posted by a reddit user. We’ll run through some basics to understand how it works.

First, we need to understand what a B-tree is (read here). A B-tree, roughly speaking, is an m-ary search tree. That is, each node can store several values ​​and several children. This allows you to perform operations standard for binary search trees, but with a logarithm from a large base. And the locality of the data helps to bypass the elements faster. Of course, although the depth of the tree becomes smaller, we should not forget about the resulting constant associated with a large number of comparisons. It looks something like this:

Next, we have B+ trees. These are B-trees that simulate an associative array. They store values ​​only in the leaf, and in other vertices only copies of the keys are stored (it turns out something like an associative array):

A leaf of B+-trees usually stores a pointer to the next leaf, allowing you to traverse all the elements of your tree in linear time without going up and down the tree.

And the last piece is the counted B-tree – a B-tree in which for each child we store the size of its subtree:

Here, sparque is a counted B+ tree. With the caveat that while in a canonical B+-tree we have both keys and values, here we store only the values ​​(in the letter). In the rest of the vertices, we only have the sizes of the child subtrees and actually the pointers to these subtrees. So at the bottom level (in the sheet), our data structure looks like a list of blocks, each of which stores multiple values. Why not deque! However, each block in the letter is something like an array, into which you can efficiently insert from both ends, so now some operations become faster.

We have them like this:

  • search by index O(log_m(n)) – as now, using the dimensions of the subtrees, we need to go down from the root of the tree to its leaf. It is longer than the standard one std::deque from his O(1)but is to some extent a tradeoff for speeding up other operations;

  • the author claims that the insertion/deletion at the beginning/end – O(1). It sounds doubtful, because, in addition to the insertion, it is also necessary to enumerate from the leaf to the root all the sizes of the trees. What it does in push_back in update_counts_plus function. But maybe I didn’t understand something.

  • insert delete at random location – O(m)where m is the number of values ​​in the letter block;

  • iteration – O(n).

Memory O(n).

Why the author decided to call this structure sparse is not known for sure.

Here are benchmarks from the author, compared to other containers. And here is the implementation.

Sparse Set

Let’s recall what a sparse set is.

A sparse set allows you to perform the same operations as any other set you are familiar with. But from the point of view it is still closer to std::vector<bool>/std::bitset. To start, we need two arrays and a size:

template <int R>
struct SparseSet {
	int size_ = 0;
	int dense_[R];
	int sparse_[R];
};

dense_ is an array of values. sparse_ also stores the value index dense_, i.e. if dense_[i] = vthen sparse_[v] = i (A pointer can be stored instead of an index). Then the addition looks like this:

void Add(int val) {
	dense_[size_] = val;
	sparse_[val] = size_;
	++size_;
}

For better understanding, let’s look at an example. Suppose we inserted the following values ​​into our set: 3, 2, 5, 1, 0.

With size_ equal to 5. We have nothing in the empty cells.

You can now check your set for:

bool Contains(int val) {
	return sparse_[val] < size_ && dense_[sparse_[val]] == val;
}

If we want to check for a binary, then our expression will look like this:

sparse_[2] < 5 && dense_[sparse_[2]] == 2

Which is equivalent

1 < 5 && 2 == 2

Two in a set!

For values ​​that are not in our set, we will check the garbage cells (if nothing was there before) and it is unlikely that our condition will pass (there is still garbage there).

If we want to remove a certain element, then we need to do a small squat. Let’s say we want to remove 0. It’s simple. It is enough for us to reduce the size and this zero will disappear during the next insertion. And if we want to remove an element that is not at the end dense_? Then we need to change the corresponding element in dense_ with the last and support the invariant y sparse_. For example, with the removal of the binary:

Now you can safely reduce size_. In code it will look like this:

 bool Remove(int i) {
   if (dense_[size_ - 1] != i) {
     int last_element = dense_[size_ - 1];
     dense_[sparse_[i]] = last_element;
     sparse_[last_element] = sparse_[i];
   }
   --size_;
 }

And of course, if we want to clear the entire data structure, we just need to do one single action:

void Сlear() { size_ = 0; }

All operations (from creation to deletion) by O(1)!

On top of that, you also have the elements next to each other in memory, which allows you to use data locality when iterating over the set.

Of course, a structure like this has a limit on the maximum value it can store. Also, if you have complex types, you will still need to call destructors when deleting. But these are already trifles.

folly has a rather primitive implementation. They fix the size, which defeats the point of customization. But maybe someone will convince them that it is not necessary?)

Of course, you can make arrays not static, but dynamic. And you can even do it not only for numeric types – with the help of hashing! However, not just anyone. Because in fact, we know all possible values ​​for numeric types, then a similar property should be used when hashing arbitrary objects so that there are no collisions. Perfect hashing can help with this.

Sparse Map

OK. Sparse Map is a prank. There is nothing interesting here. Just instead of the value in dense_ now we save a couple (key, value)and sparse_ only works with keys.

If you are interested in looking at the benchmarks, you can look here.

If you are interested in looking at the implementation, you can look at Golang or dig into the pluses of some lib.


Subscribe or unsubscribe to my programming and C++ channel: t.me/thisnotes.

Related posts