Jan 27, 2019 / ALGORITHM, PERFORMANCE, KOTLIN

The sieve of Eratosthenes

Recently, I stumbled upon a Reddit thread pointing to a repository comparing the performances of implementations of the Sieve of Eratosthenes in different languages. In short, the Sieve is an (ancient) algorithm to find all prime numbers up to a given limit.

Results were intriguing, to say the least:

Language Performance (seconds)

Language	Performance (seconds)
C	0.501	`clang -fblocks -o prime_c prime.c time ./prime_c &>/dev/null` real 0m0.501s user 0m0.490s sys 0m0.006s
Go 1.11	8.915	`go build -o prime_go prime.go time ./prime_go &>/dev/null` real 0m8.915s user 0m31.161s sys 0m0.227s
Python 3.6	10.85	`time python3 prime.py &>/dev/null` real 0m10.850s user 0m10.739s sys 0m0.066s
Ruby 2.4	481.470	`time ruby prime.rb &>/dev/null` real 8m1.470s user 7m46.913s sys 0m3.977s
Kotlin 1.3	2034.561	`kotlinc prime.kt -include-runtime -d prime_kt.jar time java -jar prime_kt.jar &>/dev/null` real 33m54.561s user 33m18.257s sys 0m11.193s

0.501

clang -fblocks  -o prime_c prime.c
time ./prime_c &>/dev/null

real    0m0.501s
user    0m0.490s
sys     0m0.006s

Go 1.11

8.915

go build -o prime_go prime.go
time ./prime_go &>/dev/null

real    0m8.915s
user    0m31.161s
sys     0m0.227s

Python 3.6

10.85

time python3 prime.py &>/dev/null

real    0m10.850s
user    0m10.739s
sys     0m0.066s

Ruby 2.4

481.470

time ruby prime.rb &>/dev/null

real    8m1.470s
user    7m46.913s
sys     0m3.977s

Kotlin 1.3

2034.561

kotlinc prime.kt -include-runtime -d prime_kt.jar
time java -jar prime_kt.jar &>/dev/null

real    33m54.561s
user    33m18.257s
sys     0m11.193s

The biggest - and worst - surprise came from Kotlin. I couldn’t believe my eyes!

Then, I looked at the code:

fun main(args: Array<String>) {
  fun make_sieve(src: Sequence<Int>, prime: Int) = src.filter { it % prime != 0 }
  var sieve = sequence {
    var x = 2
    while (true) yield(x++)
  }
  for (i in 1..10000) {
    val prime = sieve.first()
    println(prime)
    sieve = make_sieve(sieve, prime)
  }
}

After some comments from Redditors, the author replaced Sequence by Iterator, and the updated code is able to run in 2 seconds. It’s still a lot. I believe it’s possible to do much better, so let’s do it.

Naive reference implementation

Before trying to do better, let’s get back to the algorithm itself, in pseudo-code:

Input: an integer n > 1.

Let A be an array of Boolean values, indexed by integers 2 to n,
initially all set to true.

for i = 2, 3, 4, …, not exceeding √n:

  if A[i] is true:

    for j = i2, i2+i, i2+2i, i2+3i, …, not exceeding n:

      A[j] := false.

Output: all i such that A[i] is true.

In order to compare, a reference implementation is required. The following snippet is the direct translation of the above naive straightforward algorithm in Kotlin:

fun sieveSimple(n: Int): Array<Boolean> {
  val indices = Array(n) { true }
  val limit = sqrt(n.toDouble()).toInt()
  val range = 2..limit
  for (i in range) {
    if (indices[i]) {
      var j = i.toDouble().pow(2).toInt()
      while (j < n) {
        indices[j] = false
        j += i
      }
    }
  }
  return indices
}

Running this code on my machine takes about 9 milliseconds. This will be the baseline for improvements.

Coroutines first draft

Kotlin coroutines allow to run code concurrently. One easy optimization is to run a computation for a specific i on a coroutine, and aggregate the results in one dedicated array.

fun sieveCoroutines(n: Int): Array<Boolean> {
  val indices = Array(n) { true }
  val limit = sqrt(n.toDouble()).toInt()
  val range = 2..limit
  runBlocking {
    async {                                       (1)
      for (i in range) {
        val result = computeForSingleNumber(i, n)
        mergeBooleanArrays(indices, result)       (2)
      }
    }.await()
  }
  return indices
}

fun computeForSingleNumber(i: Int, n: Int): Array<Boolean> {
  val indices = Array(n) { true }
  if (indices[i]) {
    var j = i.toDouble().pow(2).toInt()
    while (j < n) {
      indices[j] = false
      j += i
    }
  }
  return indices
}

fun mergeBooleanArrays(a1: Array<Boolean>, a2: Array<Boolean>) {
  val range = 0 until a1.size
  for (i in range) a1[i] = a1[i] && a2[i]
}

1	Run instructions of the block in parallel
2	Merge the computation results of the coroutine into the indices array

The above code is executed in about 300ms on my laptop. While much better than the code of the original poster, it takes 30 times more than the above baseline! Coroutines are not to blame.

The merging of the results after each coroutine run is the block that takes time.

Coroutines and shared mutable state

Merging different arrays after each run is not the way to go.

Obviously, computations do not depend on one another: a single computation is enough to cross out a potential prime. The worst that can happen is that two different computations cross out the same number, which doesn’t change the result.

Hence, the array can be safely shared among all coroutines. The updated code looks like the following:

fun sieveSharedMutableState(n: Int): Array<Boolean> {
  val indices = Array(n) { true }
  val limit = sqrt(n.toDouble()).toInt()
  val range = 2..limit
  runBlocking {
    async {                                (1)
      for (i in range) {
        computeForSingleNumber(i, indices) (2)
      }
    }.await()
  }
  return indices
}

fun computeForSingleNumber(i: Int, indices: Array<Boolean>) {
  val n = indices.size
  if (indices[i]) {
    var j = i.toDouble().pow(2).toInt()
    while (j < n) {
      indices[j] = false                   (3)
      j += i
    }
  }
}

1	Run instructions in the block in parallel
2	Pass the shared mutable array to the coroutine
3	Update the shared mutable array from the coroutine

Running the above code on my machine takes ~4 milliseconds. A whopping improvement of 50% over my original baseline!

Conclusions

Just as with standard Java concurrent programming APIs, coroutines do not guarantee anything. They should be used with the correct data structures. Sequences are lazy, but if they are called at every iteration, they can become slower than their eager mutable counterparts.

The complete source code for this post can be found on Github.

To go further:

Follow me Follow me