10 billion integers are included in the array
How experimenting with 64-bit Pharo Smalltalk surprised me.
Contents
A 32-bit party that doesn’t want to end
I haven’t programmed in Smalltalk professionally since Y2K, and at that time I had only come across 32-bit versions of Smalltalk. From time to time I try different things in the Pharo dialect of Smalltalk, which has had a 64-bit version for several years. I’ve been waiting a long time for an open source version of Smalltalk with 64-bit address space support. I’ve been using 64-bit versions of Java for the past 18 years.
I’ve been thinking for a long time when Java will make built-in support for large arrays, namely arrays with a length greater than 2?-1. The Java Collections Framework and all frameworks that implement built-in collections interfaces such as Collection
, List
, Set
, Map
do not support the size anymore int
. The maximum number of values that can be stored in a Java array is 2?-1. For all intents and purposes, this creates a limit that allows us to store just over 2 billion objects or built-in types in a Java array. There are third-party libraries that provide BigArrays
such as fastutiland also we can simulate large arrays by ourselves due to Java multidimensional arrays, but it becomes difficult to manage. It is better to use a time-tested third-party library than to encounter bugs that are difficult to test and diagnose.
I was wondering if the 64-bit Pharo Smalltalk can store more than 2?1-1 elements in an array. I know one way to test this. I will need a lot of RAM for this. Fortunately, I bought a new Mac Book Pro M2 Max last year with 96GB of RAM to be able to run experiments and benchmarks with heavy memory usage.
The new MacBook Pro M3 Max supports up to 128GB of memory. That’s a big jump from last year, when I bought a MacBook Pro M2 Max with 96GB of RAM. 128GB is 2x more than my ten-year-old Mac Pro bucket on the desktop and 8x more than my ten-year-old MacBook Pro. Mac Pro since 2019 supports up to 1.5 TB of RAM. The current configuration of the Apple Silicon Mac Pro (2023) offers a configuration of up to 192GB, which is 3 times more than my 10-year-old Mac Pro. My guess is that in 10 years we will see 256+ GB of memory in high-end consumer laptops and more terabytes in desktops.
Note: Server machines may have terabytes of memory already.
Who needs over 2 billion elements in an array?
I never need to store more than a hundred million items in List
over the past 20 years. That’s still quite a lot of stuff and it was necessary for my tasks in 2006. I’m sure there are people who solve tasks that require storing huge amounts of data in memory.
There are approximately 8.2 billion people on earth right now. It would take 4 Java arrays to store a reference to each person in memory. Storage of 8.2 billion simple objects Person
in memory would be very costly. Under a simple object Person
i mean the class Person
with surname, first name and patronymic in the form String
. The size of the array itself would be more than 65 GB (~8.2 billion x 8 bytes per transfer). Costs for copies Person
would require significantly more memory than I have available on my 96GB MacBook Pro. Let’s assume about 8.2 billion by 32 bytes for instances Person
Which would be ~262GB. In total, we would need 327GB of memory to fit all the people, including their last names, first names, and patronymics in memory. We could create a pool of names in the form String
which would have roughly a few hundred million occurrences, so we would need about 32 GB or more to store the instances String
. It’s not quite available on regular user hardware at the moment. This would be possible on high-end servers with terabytes of memory.
This got me thinking. What if we started with an object less than Person and instead used e.g. SmallInteger
in Pharo Smalltalk. I started experimenting with creating arrays larger than 2?1-1 in Pharo. As I experiment, I learn that Pharo Smalltalk has significant optimizations for SmallInteger
. Instead of storing references to objects SmallInteger
The values themselves are inline SmallInteger
. It felt like we were promised Project Valhalla value types from the Java world. I figured this out with a little digging and experimenting with a simple method sizeInMemory
. I didn’t immediately understand what was happening when the size of the instances was reported SmallInteger
was equal to zero. It made me realize that SmallInteger
was processed in the language in some special way. I was also surprised that SmallInteger
exceeded the maximum value int
in Java.
Transcript show: SmallInteger maxVal.
// Prints - 1152921504606846975
// SmallInteger -> 1,152,921,504,606,846,975
// Max Java long -> 9,223,372,036,854,775,807
This value is more like long
in Java. Value Long.MAX_VALUE
same thing in Java 9,223,372,036,854,775,807
.
In this article tells about the maximum value SmallInteger
in Smalltalk, and how values themselves are stored instead of references. SmallInteger
uses 61
bit instead 64
.
The biggest difference between SmallInteger
in Smalltalk and long
Java is what happens when you add one to the maximum value.
In Smalltalk we get LargePositiveInteger
. Dynamic typing will help in a hurry.
Transcript cr; show: SmallInteger maxVal.
Transcript cr; show: SmallInteger maxVal class.
Transcript cr; show: SmallInteger maxVal + 1.
Transcript cr; show: (SmallInteger maxVal + 1) class.
// Prints
// 1152921504606846975
// SmallInteger
// 1152921504606846976
// LargePositiveInteger
When we add 1
to the maximum value SmallInteger
Smalltalk creates instances dynamically LargePositiveInteger
. This is the advantage of a dynamic language where everything is an object.
In Java, we get a silent but potentially fatal overflow.
System.out.println(BigInteger.valueOf(Long.MAX_VALUE));
System.out.println(BigInteger.valueOf(Long.MAX_VALUE + 1));
// Prints
// 9223372036854775807
// -9223372036854775808
Adding 1 to the maximum value of long causes an overflow and the result becomes negative. We cannot dynamically change the type of long
in any other in Java. What happened long
remains long
what was short
– also remains short
. This is one of those times where static typing draws the short straw. We learned how to find workarounds for this in Java.
Let’s end this and move on to our experiments.
Experiments
I couldn’t stop at the implementation in Smalltalk without trying to implement it in Java. Pharo Smalltalk gave me all the tools I needed. I used a combination of libraries fastutil and Eclipse Collections to replicate the Java experiment. One of the positives about Java is that it has a rich ecosystem whose members have created many solutions for most tasks you might encounter.
Pharo version of Smalltalk
I started with 5 billion instances SmallInteger
in Array
in Pharo. After it worked, I increased the total to 8.2 billion. It still worked. Then tried 10 billion. It still worked. I was very surprised. I didn’t think I had enough memory to store 10 billion instances. That’s because I didn’t understand at the time how Smalltalk handled “instances” SmallInteger
.
Below is the source code for the 10 billion experiment. You will need 96 GB of memory, of which about 85 must be free to run this code. You can reduce the value to 5 billion if you have 64 GB of memory.
|array sum|
array := (1 to: 10000000000) collect: [ :each | each * 2 ].
sum := array sum.
Transcript cr; show: array size class.
Transcript cr; show: array size.
Transcript cr; show: sum.
Transcript cr; show: sum class.
(1 to: 10000000000 by: 1000000000) do: [ :index | Transcript cr;
show: 'index: ';
show: index;
show: ' value: ';
show: (array at: index);
show: ' ';
show: (array at: index) class ].
Transcript cr; show: 'Array memory (bytes) ';
show: array sizeInMemory.
Transcript cr; show: 'Elements memory (bytes) ';
show: (array sumNumbers: #sizeInMemory).
The output of the code below:
SmallInteger
10000000000
100000000010000000000
LargePositiveInteger
index: 1 value: 2 SmallInteger
index: 1000000001 value: 2000000002 SmallInteger
index: 2000000001 value: 4000000002 SmallInteger
index: 3000000001 value: 6000000002 SmallInteger
index: 4000000001 value: 8000000002 SmallInteger
index: 5000000001 value: 10000000002 SmallInteger
index: 6000000001 value: 12000000002 SmallInteger
index: 7000000001 value: 14000000002 SmallInteger
index: 8000000001 value: 16000000002 SmallInteger
index: 9000000001 value: 18000000002 SmallInteger
Array memory (bytes) 80000000016
Elements memory (bytes) 0
These results show that in SmallInteger
there are no additional costs for the instances themselves. Their values are inlined in Array
. So we only need memory for the Array itself, which is about 80 GB.
Java version with fastutil
Below is the source code for the 10 billion Java experiment. You will also need 96 GB of memory, of which 85 are free. I added the ‑Xmx85g argument to the command line. You can also reduce the number to 5 billion if you have 64 GB of RAM. For a large list long
used fastutil. To summarize BigInteger
– Eclipse Collections. Why was it necessary to use it? BigInteger
you will see further.
First I added the fastutil library as a Maven dependency.
it.unimi.dsi
fastutil
8.5.14
Then I wrote a test that uses LongBigArrayBigList
to store 10 billion longs. This is roughly equivalent to an array of 10 billion elements SmallInteger
in Smalltalk.
@Test
public void fastutilTenBillion()
{
LongBigArrayBigList longs = new LongBigArrayBigList(10_000_000_000L);
LongStream.rangeClosed(1, 10_000_000_000L)
.forEach(l -> longs.add(l * 2));
long sum = longs.longStream().sum();
BigInteger bigSum = longs.longStream()
.boxed()
.collect(Collectors2.summingBigInteger(BigInteger::valueOf));
System.out.println("size: " + longs.size64());
System.out.println("long sum: " + sum);
System.out.println("BigInteger sum: " + bigSum);
for (long l = 0; l
The results are as follows:
size: 10000000000
long sum: 7766279641452241920
BigInteger sum: 100000000010000000000
index: 0 value: 2
index: 1000000000 value: 2000000002
index: 2000000000 value: 4000000002
index: 3000000000 value: 6000000002
index: 4000000000 value: 8000000002
index: 5000000000 value: 10000000002
index: 6000000000 value: 12000000002
index: 7000000000 value: 14000000002
index: 8000000000 value: 16000000002
index: 9000000000 value: 18000000002
Now, the first thing you’ll probably notice if you haven’t come across this is that Java indexes start with 0, while Smalltalk starts with 1. These values are correct. We only need to remember this comparison of results.
Another point is that sums long
and BigInteger
are different But why?
The following test will show us the reasons:
@Test
public void longMaxValue()
{
BigInteger maxLong = BigInteger.valueOf(Long.MAX_VALUE);
BigInteger sum = new BigInteger("100000000010000000000");
System.out.println(maxLong);
System.out.println(sum);
}
Results:
9223372036854775807 // Max Long Value in Java
100000000010000000000 // Sum of 10 billion long values
The maximum value of long is two bits shorter than the sum. The sum of ten billion numbers multiplied by 2 is a value greater than the maximum value long
in Java. So I needed to collect BigInteger
for the sum of numbers from String
— the value itself is too large for long
. I didn’t initially assume that this test would cause a value to overflow long
it is more characteristic of values int
. Maximum value long
HUGE, but, as practice has shown, not huge enough.
When I realized that the amount was incorrect, I decided to try to use it Collectors2.summingBigInteger()
Collector
from Eclipse Collections. It did the job well with one drawback – I need to pack the values long
with LongStream
in Long
and then re-in BigInteger
. Maybe I would figure out a way to just use the method collect
on LongStream
if I continued to deal with the code a little longer, but it would require mutating result, so I decided not to bother.
Reflections on limitations
Most developers currently do not need arrays longer than 2?1-1 elements. This will most likely be true next year as well. But over the course of 5, 10, 20 years, this number will gradually touch more and more developers as work with large volumes of data becomes more widespread. Collections of size 2?1-1 elements can become more difficult to use. We already have ready-made solutions for Java if we need a larger amount int
or long
. Sometimes it can be difficult to write code that works quickly and correctly when you are pushing against the limits of primitives or arrays.
In the late 1980s and early 1990s, nobody needed more than 640K of RAM. Now we can buy 128GB on custom laptops. Our great-grandchildren may laugh when they hear how we suffered using 64-bit computing. The trends in the gland tell us that progress does not stop.
When we encounter limitations in hardware or software, we have to find solutions. I’ve worked in a memory-constrained environment since I started professional development in the late 1980s. Hardware and software have always found a way to keep up with the times. And when that doesn’t happen, we have to get creative.
Java has many important problems that need to be solved and I’m not sure when the maximum array size problem will become a priority. I believe that in the next decade. And until that happens, libraries like fastutil can help solve the problem now.
The thing that continues to fascinate me about Smalltalk is its adaptability. Add a little more – and it would seem that you should fly off the cliff, but no – Smalltalk will pleasantly surprise you and return another type capable of processing your request. And I often miss this feature of his.