I’ve recently answered a question on the German Quora (https://de.quora.com/Warum-sollte-man-in-Java-lieber-auf-die-Rekursion-verzichten/answer/Markus-Toman) that stated: “Why should you refrain from recursion in Java?”.

I think we should avoid recursion by default, except if:

  • you are working in a language where recursion is essential and highly-optimized (Haskell, LISP dialects, Erlang/elixir and most other functional languages)
  • the recursive version is much easier to maintain and understand AND does not negatively affect performance to an unreasonable degree
  • the recursive algorithm is easier parallelizable (typically because of immutability)

In this post we discuss this topic, starting with high-level Python, going over to C++ and finally ending up in assembly.

Especially when a language does not support Tail call optimization (TCO – see also Tail Recursion – GeeksforGeeks) we have to worry about memory usage and performance and in the worst case a stack overflow. While recursion can often be beautiful and represent a solution clearer, many languages are not well-designed to handle them. So of course, if you’re a functional programmer – have fun and recurse you brain out. If you’re dealing with a language like Python, be a bit more careful.

Especially if you’re dealing with high recursion depth or many parallel tasks performing recursion.

Let’s start at a high level

Dynamic time warping in Python

A couple years ago I used an existing, recursive implementation of Dynamic time warping (DTW) with cross-entropy to map sequences of probability distributions.

The relevant part is shown here:

def calculate_backward(self, i1, i2):
'''
Calculate the dtw distance between
seq1[:i1 + 1] and seq2[:i2 + 1]
'''
if self._map.get((i1, i2)) is not None:
return self._map[(i1, i2)]

if i1 == -1 or i2 == -1:
self._map[(i1, i2)] = float('inf')
return float('inf')

min_i1, min_i2 = min((i1 - 1, i2), (i1, i2 - 1), (i1 - 1, i2 - 1),
key=lambda x: self.calculate_backward(*x) )

self._map[(i1, i2)] = self.get_distance(i1, i2) + self.calculate_backward(min_i1, min_i2)

return self._map[(i1, i2)]

(see also m-toman/AMTV @ github)

While I wasn’t dealing with very long sequences, this implementation blew up pretty quickly with a stack overflow.

So I implemented the iterative version as follows:

for i in range(0, len(self._seq1)):
self._map[(i,-1)] = float('inf')

for i in range(0, len(self._seq2)):
self._map[(-1,i)] = float('inf')

self._map[(-1,-1)] = 0

for i in range(0, len(self._seq1)):
for j in range(0, len(self._seq2)):
cost = self.get_distance(i, j)
self.distance_map[(i,j)] = cost
self._map[(i,j)] = cost + min(self._map[ i-1 , j    ],      # insertion
self._map[ i   ,  j-1 ],      # deletion
self._map[ i-1 ,  j-1 ]       # match
)
return self._map

m-toman/AMTV @github

And it worked like a charm.

Let’s dig deeper…

Toy example in C++

I came up with a slightly simpler example so we have an easier time analyzing it.

The following program fills an array with its index times two and also calculates the sum of it (don’t ask…). We repeat this a couple of times so we can better compare the runtime costs.

So let’s look at an iterative and a recursive implementation:

// iterative.cpp
int main() {
const int max_arr_len = 100000;
const int max_runs = 10000;
int arr[max_arr_len];
for (int runs = 0; runs < max_runs; ++runs) {
long sum = 0L;
for(int i = 0; i < max_arr_len; ++i) {
arr[i] = i + i;
sum += (long)arr[i];
}
}
}

// recursive.cpp
long recur(int* arr, int idx, long sum, int max) {
if(idx >= max) {
return sum;
}
else {
arr[idx] = idx + idx;
return recur(arr, idx+1, sum+(long)arr[idx], max);
}
}
int main() {
const int max_arr_len = 100000;
const int max_runs = 10000;
int arr[max_arr_len];
for (int runs = 0; runs &lt; max_runs; ++runs) {
long sum = recur(arr, 0, 0L, max_arr_len);
}
}

Now we’ll time the run:

$ g++ recursive.cpp ; time ./a.out
real	0m17.732s
user	0m16.236s
sys	0m0.139s

$ g++ iterative.cpp ; time ./a.out
real	0m3.305s
user	0m3.254s
sys	0m0.016s

Wow, the iterative version is more than 5x faster.

Again, let’s dig a bit deeper.

Toy example in Assembly

Let’s first look at the (unoptimized) assembly of the recursive version. I’ve thrown out non-essential parts and added a couple comments:

## recursive.s
__Z5recurPiili:                         ## @_Z5recurPiili
## Saving stackpointer, reserve 48 bytes for this call:
pushq   %rbp
movq    %rsp, %rbp
subq    $48, %rsp

## ...
## go into the recursion part:
cmpl    -36(%rbp), %ecx
jl      LBB0_2

## ...
## otherwise leave:
jmp     LBB0_3

## do calculations and then call the recursion:
LBB0_2:
movl    -20(%rbp), %eax
addl    -20(%rbp), %eax
movq    -16(%rbp), %rcx
movslq  -20(%rbp), %rdx
movl    %eax, (%rcx,%rdx,4)
movq    -16(%rbp), %rdi
movl    -20(%rbp), %eax
addl    $1, %eax
movq    -32(%rbp), %rcx
movq    -16(%rbp), %rdx
movslq  -20(%rbp), %rsi
movslq  (%rdx,%rsi,4), %rdx
addq    %rdx, %rcx
movl    -36(%rbp), %r8d
movl    %eax, %esi
movq    %rcx, %rdx
movl    %r8d, %ecx
callq   __Z5recurPiili      ## &lt;----

## return
LBB0_3:
movq    -8(%rbp), %rax
addq    $48, %rsp
popq    %rbp
retq

“callq __Z5recurPiili” first advances the stack pointer with each call for 48 bytes (subq $48, %rsp) to hold the states of the local variables. Also, additional space will be reserved for the return addresses etc. (see CALL – Call Procedure for more details). This means the need for stack increases linearly with recursion depth as we gradually access more virtual address space.

Now the iterative version. To make the a bit easier to compare, I’ve pulled out the inner loop into a function:

// iterative_inner.cpp
long iterate(int* arr, int len) {
long sum = 0;
for(int i = 0; i &lt; len; ++i) {
arr[i] = i + i;
sum += (long)arr[i];
}
return sum;
}
//snip

And the resulting assembly, again pruned to the essentials and comments added:

## iterative_inner.s
__Z7iteratePii:                         ## @_Z7iteratePii
## save the stack pointer again
pushq   %rbp
movq    %rsp, %rbp

## ...
## check the loop condition and leave if done
LBB0_1:
movl    -28(%rbp), %eax
cmpl    -12(%rbp), %eax
jge     LBB0_4

## otherwise do our calculations
movl    -28(%rbp), %eax
addl    -28(%rbp), %eax
movq    -8(%rbp), %rcx
movslq  -28(%rbp), %rdx
movl    %eax, (%rcx,%rdx,4)
movq    -8(%rbp), %rcx
movslq  -28(%rbp), %rdx
movslq  (%rcx,%rdx,4), %rcx
addq    -24(%rbp), %rcx
movq    %rcx, -24(%rbp)
movl    -28(%rbp), %eax
addl    $1, %eax
movl    %eax, -28(%rbp)
## now continue the loop
jmp     LBB0_1
LBB0_4:
movq    -24(%rbp), %rax
popq    %rbp
retq

The function only once reservers the stack for the function call, otherwise we don’t access any further virtual address space.

To sum up, the iterative version is not only more than 5x faster, it also has constant memory requirements, no matter how many iterations.

But this is without optimization…

Toy example in Assembly, optimized

Now we activate the compiler optimizations and recompile:

 

$ g++ -O3 iterative.cpp ; time ./a.out
real	0m0.005s
user	0m0.001s
sys	0m0.002s

$ g++ -O3 recursive.cpp ; time ./a.out
real	0m0.004s
user	0m0.001s
sys	0m0.002s

Oh, wait, what happened here?
Seems the compiler pruned a bit too much, as we don’t seem to use any of the results.

So let’s do modify the code a bit:

// iterative.cpp
#include <iostream>
int main() {
const int max_arr_len = 100000;
const int max_runs = 1000000;
int arr[max_arr_len];
long sumsum = 0;
long randsum = 0;
for (int runs = 0; runs < max_runs; ++runs) {
long sum = 0;
for(int i = 0; i < max_arr_len; ++i) {
arr[i] = i + i;
sum += (long)arr[i];
}
sumsum += sum;
randsum += arr[100];
}
std::cout << sumsum << std::endl;
std::cout << randsum << std::endl;
}

// recursive.cpp
#include <iostream>
long recur(int* arr, int idx, long sum, int max) {
if(idx >= max) {
return sum;
}
else {
arr[idx] = idx + idx;
return recur(arr, idx+1, sum+(long)arr[idx], max);
}
}

int main() {
const int max_arr_len = 100000;
const int max_runs = 1000000;
int arr[max_arr_len];
long sumsum = 0;
long randsum = 0;
for (int runs = 0; runs < max_runs; ++runs) {
long sum = recur(arr, 0, 0L, max_arr_len);
sumsum += sum;
randsum += arr[100];
}
std::cout << sumsum << std::endl;
std::cout << randsum << std::endl;
}

I’ve also pumped up the number of iterations a bit for the optimized version.
So now we can run it again:

$ g++ -O3 iterative.cpp ; time ./a.out
9999900000000000
200000000
real	0m21.900s
user	0m21.127s
sys	0m0.098s

$ g++ -O3 recursive.cpp ; time ./a.out
9999900000000000
200000000
real	0m23.792s
user	0m21.582s
sys	0m0.172s

Neat, not much of a difference now.

So it seems the compiler was able to deal with the recursion.

Summary

What we saw in this example is that not only functional language compilers are able to deal with recursion, but we also saw in the Python example that we can’t necessarily rely on that. Especially in dynamic and interpreted languages.

So coming back to my original statement: the iterative version of an algorithm is usually the safer bet in imperative languages. This is partially because of the inherent immutability of recursion, i.e. we don’t mutate the index and the sum and without TCO we have to store all their copies. On the other hand, this aspect makes parallelization much easier, but we will deal with this topic another time.

But as usual: “Measure, measure, measure!” 🙂