I’ve recently answered a question on the German Quora (https://de.quora.com/Warum-sollte-man-in-Java-lieber-auf-die-Rekursion-verzichten/answer/Markus-Toman) that stated: “Why should you refrain from recursion in Java?”.

I think we should avoid recursion by default, except if:

- you are working in a language where recursion is essential and highly-optimized (Haskell, LISP dialects, Erlang/elixir and most other functional languages)
- the recursive version is much easier to maintain and understand AND does not negatively affect performance to an unreasonable degree
- the recursive algorithm is easier parallelizable (typically because of immutability)

In this post we discuss this topic, starting with high-level Python, going over to C++ and finally ending up in assembly.

Especially when a language does not support Tail call optimization (TCO – see also Tail Recursion – GeeksforGeeks) we have to worry about memory usage and performance and in the worst case a stack overflow. While recursion can often be beautiful and represent a solution clearer, many languages are not well-designed to handle them. So of course, if you’re a functional programmer – have fun and recurse you brain out. If you’re dealing with a language like Python, be a bit more careful.

Especially if you’re dealing with high recursion depth or many parallel tasks performing recursion.

Let’s start at a high level

## Dynamic time warping in Python

A couple years ago I used an existing, recursive implementation of Dynamic time warping (DTW) with cross-entropy to map sequences of probability distributions.

The relevant part is shown here:

def calculate_backward(self, i1, i2): ''' Calculate the dtw distance between seq1[:i1 + 1] and seq2[:i2 + 1] ''' if self._map.get((i1, i2)) is not None: return self._map[(i1, i2)] if i1 == -1 or i2 == -1: self._map[(i1, i2)] = float('inf') return float('inf') min_i1, min_i2 = min((i1 - 1, i2), (i1, i2 - 1), (i1 - 1, i2 - 1), key=lambda x: self.calculate_backward(*x) ) self._map[(i1, i2)] = self.get_distance(i1, i2) + self.calculate_backward(min_i1, min_i2) return self._map[(i1, i2)]

(see also m-toman/AMTV @ github)

While I wasn’t dealing with very long sequences, this implementation blew up pretty quickly with a stack overflow.

So I implemented the iterative version as follows:

for i in range(0, len(self._seq1)): self._map[(i,-1)] = float('inf') for i in range(0, len(self._seq2)): self._map[(-1,i)] = float('inf') self._map[(-1,-1)] = 0 for i in range(0, len(self._seq1)): for j in range(0, len(self._seq2)): cost = self.get_distance(i, j) self.distance_map[(i,j)] = cost self._map[(i,j)] = cost + min(self._map[ i-1 , j ], # insertion self._map[ i , j-1 ], # deletion self._map[ i-1 , j-1 ] # match ) return self._map

And it worked like a charm.

Let’s dig deeper…

## Toy example in C++

I came up with a slightly simpler example so we have an easier time analyzing it.

The following program fills an array with its index times two and also calculates the sum of it (don’t ask…). We repeat this a couple of times so we can better compare the runtime costs.

So let’s look at an iterative and a recursive implementation:

// iterative.cpp int main() { const int max_arr_len = 100000; const int max_runs = 10000; int arr[max_arr_len]; for (int runs = 0; runs < max_runs; ++runs) { long sum = 0L; for(int i = 0; i < max_arr_len; ++i) { arr[i] = i + i; sum += (long)arr[i]; } } } // recursive.cpp long recur(int* arr, int idx, long sum, int max) { if(idx >= max) { return sum; } else { arr[idx] = idx + idx; return recur(arr, idx+1, sum+(long)arr[idx], max); } } int main() { const int max_arr_len = 100000; const int max_runs = 10000; int arr[max_arr_len]; for (int runs = 0; runs < max_runs; ++runs) { long sum = recur(arr, 0, 0L, max_arr_len); } }

Now we’ll time the run:

$ g++ recursive.cpp ; time ./a.out real 0m17.732s user 0m16.236s sys 0m0.139s $ g++ iterative.cpp ; time ./a.out real 0m3.305s user 0m3.254s sys 0m0.016s

Wow, the iterative version is more than 5x faster.

Again, let’s dig a bit deeper.

## Toy example in Assembly

Let’s first look at the (unoptimized) assembly of the recursive version. I’ve thrown out non-essential parts and added a couple comments:

## recursive.s __Z5recurPiili: ## @_Z5recurPiili ## Saving stackpointer, reserve 48 bytes for this call: pushq %rbp movq %rsp, %rbp subq $48, %rsp ## ... ## go into the recursion part: cmpl -36(%rbp), %ecx jl LBB0_2 ## ... ## otherwise leave: jmp LBB0_3 ## do calculations and then call the recursion: LBB0_2: movl -20(%rbp), %eax addl -20(%rbp), %eax movq -16(%rbp), %rcx movslq -20(%rbp), %rdx movl %eax, (%rcx,%rdx,4) movq -16(%rbp), %rdi movl -20(%rbp), %eax addl $1, %eax movq -32(%rbp), %rcx movq -16(%rbp), %rdx movslq -20(%rbp), %rsi movslq (%rdx,%rsi,4), %rdx addq %rdx, %rcx movl -36(%rbp), %r8d movl %eax, %esi movq %rcx, %rdx movl %r8d, %ecx callq __Z5recurPiili ## <---- ## return LBB0_3: movq -8(%rbp), %rax addq $48, %rsp popq %rbp retq

“callq __Z5recurPiili” first advances the stack pointer with each call for 48 bytes (subq $48, %rsp) to hold the states of the local variables. Also, additional space will be reserved for the return addresses etc. (see CALL – Call Procedure for more details). This means the need for stack increases linearly with recursion depth as we gradually access more virtual address space.

Now the iterative version. To make the a bit easier to compare, I’ve pulled out the inner loop into a function:

// iterative_inner.cpp long iterate(int* arr, int len) { long sum = 0; for(int i = 0; i < len; ++i) { arr[i] = i + i; sum += (long)arr[i]; } return sum; } //snip

And the resulting assembly, again pruned to the essentials and comments added:

## iterative_inner.s __Z7iteratePii: ## @_Z7iteratePii ## save the stack pointer again pushq %rbp movq %rsp, %rbp ## ... ## check the loop condition and leave if done LBB0_1: movl -28(%rbp), %eax cmpl -12(%rbp), %eax jge LBB0_4 ## otherwise do our calculations movl -28(%rbp), %eax addl -28(%rbp), %eax movq -8(%rbp), %rcx movslq -28(%rbp), %rdx movl %eax, (%rcx,%rdx,4) movq -8(%rbp), %rcx movslq -28(%rbp), %rdx movslq (%rcx,%rdx,4), %rcx addq -24(%rbp), %rcx movq %rcx, -24(%rbp) movl -28(%rbp), %eax addl $1, %eax movl %eax, -28(%rbp) ## now continue the loop jmp LBB0_1 LBB0_4: movq -24(%rbp), %rax popq %rbp retq

The function only once reservers the stack for the function call, otherwise we don’t access any further virtual address space.

To sum up, the iterative version is not only more than 5x faster, it also has constant memory requirements, no matter how many iterations.

But this is without optimization…

## Toy example in Assembly, optimized

Now we activate the compiler optimizations and recompile:

$ g++ -O3 iterative.cpp ; time ./a.out real 0m0.005s user 0m0.001s sys 0m0.002s $ g++ -O3 recursive.cpp ; time ./a.out real 0m0.004s user 0m0.001s sys 0m0.002s

Oh, wait, what happened here?

Seems the compiler pruned a bit too much, as we don’t seem to use any of the results.

So let’s do modify the code a bit:

// iterative.cpp #include <iostream> int main() { const int max_arr_len = 100000; const int max_runs = 1000000; int arr[max_arr_len]; long sumsum = 0; long randsum = 0; for (int runs = 0; runs < max_runs; ++runs) { long sum = 0; for(int i = 0; i < max_arr_len; ++i) { arr[i] = i + i; sum += (long)arr[i]; } sumsum += sum; randsum += arr[100]; } std::cout << sumsum << std::endl; std::cout << randsum << std::endl; } // recursive.cpp #include <iostream> long recur(int* arr, int idx, long sum, int max) { if(idx >= max) { return sum; } else { arr[idx] = idx + idx; return recur(arr, idx+1, sum+(long)arr[idx], max); } } int main() { const int max_arr_len = 100000; const int max_runs = 1000000; int arr[max_arr_len]; long sumsum = 0; long randsum = 0; for (int runs = 0; runs < max_runs; ++runs) { long sum = recur(arr, 0, 0L, max_arr_len); sumsum += sum; randsum += arr[100]; } std::cout << sumsum << std::endl; std::cout << randsum << std::endl; }

I’ve also pumped up the number of iterations a bit for the optimized version.

So now we can run it again:

$ g++ -O3 iterative.cpp ; time ./a.out 9999900000000000 200000000 real 0m21.900s user 0m21.127s sys 0m0.098s $ g++ -O3 recursive.cpp ; time ./a.out 9999900000000000 200000000 real 0m23.792s user 0m21.582s sys 0m0.172s

Neat, not much of a difference now.

So it seems the compiler was able to deal with the recursion.

## Summary

What we saw in this example is that not only functional language compilers are able to deal with recursion, but we also saw in the Python example that we can’t necessarily rely on that. Especially in dynamic and interpreted languages.

So coming back to my original statement: the iterative version of an algorithm is usually the safer bet in imperative languages. This is partially because of the inherent immutability of recursion, i.e. we don’t mutate the index and the sum and without TCO we have to store all their copies. On the other hand, this aspect makes parallelization much easier, but we will deal with this topic another time.

But as usual: “Measure, measure, measure!” ðŸ™‚

## Recent Comments