Recently, I stumbled upon a Reddit thread pointing to a repository comparing the performances of implementations of the Sieve of Eratosthenes in different languages. In short, the Sieve is an (ancient) algorithm to find all prime numbers up to a given limit.
Results were intriguing, to say the least:
Language | Performance (seconds) | |
---|---|---|
C |
0.501 |
real 0m0.501s user 0m0.490s sys 0m0.006s |
Go 1.11 |
8.915 |
real 0m8.915s user 0m31.161s sys 0m0.227s |
Python 3.6 |
10.85 |
real 0m10.850s user 0m10.739s sys 0m0.066s |
Ruby 2.4 |
481.470 |
real 8m1.470s user 7m46.913s sys 0m3.977s |
Kotlin 1.3 |
2034.561 |
real 33m54.561s user 33m18.257s sys 0m11.193s |
The biggest - and worst - surprise came from Kotlin. I couldn’t believe my eyes!
Then, I looked at the code:
fun main(args: Array<String>) {
fun make_sieve(src: Sequence<Int>, prime: Int) = src.filter { it % prime != 0 }
var sieve = sequence {
var x = 2
while (true) yield(x++)
}
for (i in 1..10000) {
val prime = sieve.first()
println(prime)
sieve = make_sieve(sieve, prime)
}
}
After some comments from Redditors, the author replaced Sequence
by Iterator
, and the updated code is able to run in 2 seconds.
It’s still a lot.
I believe it’s possible to do much better, so let’s do it.
Naive reference implementation
Before trying to do better, let’s get back to the algorithm itself, in pseudo-code:
Input: an integer n > 1. Let A be an array ofBoolean
values, indexed by integers 2 to n, initially all set totrue
. for i = 2, 3, 4, …, not exceeding √n: if A[i] istrue
: for j = i2, i2+i, i2+2i, i2+3i, …, not exceeding n: A[j] :=false
. Output: all i such that A[i] istrue
.
In order to compare, a reference implementation is required. The following snippet is the direct translation of the above naive straightforward algorithm in Kotlin:
fun sieveSimple(n: Int): Array<Boolean> {
val indices = Array(n) { true }
val limit = sqrt(n.toDouble()).toInt()
val range = 2..limit
for (i in range) {
if (indices[i]) {
var j = i.toDouble().pow(2).toInt()
while (j < n) {
indices[j] = false
j += i
}
}
}
return indices
}
Running this code on my machine takes about 9 milliseconds. This will be the baseline for improvements.
Coroutines first draft
Kotlin coroutines allow to run code concurrently.
One easy optimization is to run a computation for a specific i
on a coroutine, and aggregate the results in one dedicated array.
fun sieveCoroutines(n: Int): Array<Boolean> {
val indices = Array(n) { true }
val limit = sqrt(n.toDouble()).toInt()
val range = 2..limit
runBlocking {
async { (1)
for (i in range) {
val result = computeForSingleNumber(i, n)
mergeBooleanArrays(indices, result) (2)
}
}.await()
}
return indices
}
fun computeForSingleNumber(i: Int, n: Int): Array<Boolean> {
val indices = Array(n) { true }
if (indices[i]) {
var j = i.toDouble().pow(2).toInt()
while (j < n) {
indices[j] = false
j += i
}
}
return indices
}
fun mergeBooleanArrays(a1: Array<Boolean>, a2: Array<Boolean>) {
val range = 0 until a1.size
for (i in range) a1[i] = a1[i] && a2[i]
}
1 | Run instructions of the block in parallel |
2 | Merge the computation results of the coroutine into the indices array |
The above code is executed in about 300ms on my laptop. While much better than the code of the original poster, it takes 30 times more than the above baseline! Coroutines are not to blame.
The merging of the results after each coroutine run is the block that takes time.
Coroutines and shared mutable state
Merging different arrays after each run is not the way to go.
Obviously, computations do not depend on one another: a single computation is enough to cross out a potential prime. The worst that can happen is that two different computations cross out the same number, which doesn’t change the result.
Hence, the array can be safely shared among all coroutines. The updated code looks like the following:
fun sieveSharedMutableState(n: Int): Array<Boolean> {
val indices = Array(n) { true }
val limit = sqrt(n.toDouble()).toInt()
val range = 2..limit
runBlocking {
async { (1)
for (i in range) {
computeForSingleNumber(i, indices) (2)
}
}.await()
}
return indices
}
fun computeForSingleNumber(i: Int, indices: Array<Boolean>) {
val n = indices.size
if (indices[i]) {
var j = i.toDouble().pow(2).toInt()
while (j < n) {
indices[j] = false (3)
j += i
}
}
}
1 | Run instructions in the block in parallel |
2 | Pass the shared mutable array to the coroutine |
3 | Update the shared mutable array from the coroutine |
Running the above code on my machine takes ~4 milliseconds. A whopping improvement of 50% over my original baseline!
Conclusions
Just as with standard Java concurrent programming APIs, coroutines do not guarantee anything. They should be used with the correct data structures. Sequences are lazy, but if they are called at every iteration, they can become slower than their eager mutable counterparts.